Character set and Java

by Hideyuki Inada, Capitola Computing Inc.

Introduction

Character set conversion is one of trickiest areas in Java. In this section, we will look at several areas that can cause character corruption problems if you are not aware of the underlying character set or encoding issues.

 

Character set used in Java and javac

First, let's take a look at some simple code that contains a Japanese string, and go through the compilation process to see what character sets are used in each of the file formats that are used in the typical development cycle.

In this example, the string is output to the console, and the codepage for the console is set to codepage 932 which is used for displaying the source code and the output. Codepage 932 is commonly known as Shift JIS and is the standard character set for personal computers used in Japan:

 

C:\programs\javatest\site>chcp

現在のコード ページ: 932

 

Shown below is the source code displayed to the same console using the "type" command:

 

C:\programs\javatest\site>type intl1.java

public class intl1 {

 

  public static void main(String args[])

  {

    System.out.print("ソフトウェアの国際化のデモ");

  }

}

intl1.java

 

If you compile intl1.java using javac, and run it with the "java" command, you'll see the following output on the console:

C:\programs\javatest\site>java intl1

ソフトウェアの国際化のデモ

Output of intl1.class

 

As you can see:

l         Source code is encoded in codepage 932.

l         Output is generated in codepage 932.

 

At this point, we still do not know how the string is encoded in the class file.  A natural guess would be that those letters are encoded in codepage 932 as well.

If the strings are encoded in codepage 932, the string "ソフトウェアの国際化のデモ" should be stored as (shown in hexadecimal), and we should be able to find these in the class file (unless the literal strings are compressed):

835c 8374 8367 8345 8346 8341 82cc 8d91 8ddb 89bb 82cc 8366 8382

 

Let's dump the content to see if this assumption is true:

 

C:\programs\javatest\site>tdump intl1.class

Turbo Dump  Version 5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation

                    Display of File INTL1.CLASS

 

000000: CA FE BA BE 00 03 00 2D  00 22 0A 00 06 00 14 09 ハ.コセ...-."......

000010: 00 15 00 16 08 00 17 0A  00 18 00 19 07 00 1A 07 ................

000020: 00 1B 01 00 06 3C 69 6E  69 74 3E 01 00 03 28 29 .....<init>...()

000030: 56 01 00 04 43 6F 64 65  01 00 0F 4C 69 6E 65 4E V...Code...LineN

000040: 75 6D 62 65 72 54 61 62  6C 65 01 00 12 4C 6F 63 umberTable...Loc

000050: 61 6C 56 61 72 69 61 62  6C 65 54 61 62 6C 65 01 alVariableTable.

000060: 00 04 74 68 69 73 01 00  07 4C 69 6E 74 6C 31 3B ..this...Lintl1;

000070: 01 00 04 6D 61 69 6E 01  00 16 28 5B 4C 6A 61 76 ...main...([Ljav

000080: 61 2F 6C 61 6E 67 2F 53  74 72 69 6E 67 3B 29 56 a/lang/String;)V

000090: 01 00 04 61 72 67 73 01  00 13 5B 4C 6A 61 76 61 ...args...[Ljava

0000A0: 2F 6C 61 6E 67 2F 53 74  72 69 6E 67 3B 01 00 0A /lang/String;...

0000B0: 53 6F 75 72 63 65 46 69  6C 65 01 00 0A 69 6E 74 SourceFile...int

0000C0: 6C 31 2E 6A 61 76 61 0C  00 07 00 08 07 00 1C 0C l1.java.........

0000D0: 00 1D 00 1E 01 00 27 E3  82 BD E3 83 95 E3 83 88 ......'繧ス繝輔ヨ

0000E0: E3 82 A6 E3 82 A7 E3 82  A2 E3 81 AE E5 9B BD E9 繧ヲ繧ァ繧「縺ョ蝗ス髫

0000F0: E9 9B E5 8C 96 E3 81 AE  E3 83 87 E3 83 A2 07 00  帛喧縺ョ繝・Δ..

000100: 1F 0C 00 20 00 21 01 00  05 69 6E 74 6C 31 01 00 ... .!...intl1..

000110: 10 6A 61 76 61 2F 6C 61  6E 67 2F 4F 62 6A 65 63 .java/lang/Objec

000120: 74 01 00 10 6A 61 76 61  2F 6C 61 6E 67 2F 53 79 t...java/lang/Sy

000130: 73 74 65 6D 01 00 03 6F  75 74 01 00 15 4C 6A 61 stem...out...Lja

000140: 76 61 2F 69 6F 2F 50 72  69 6E 74 53 74 72 65 61 va/io/PrintStrea

000150: 6D 3B 01 00 13 6A 61 76  61 2F 69 6F 2F 50 72 69 m;...java/io/Pri

000160: 6E 74 53 74 72 65 61 6D  01 00 05 70 72 69 6E 74 ntStream...print

000170: 01 00 15 28 4C 6A 61 76  61 2F 6C 61 6E 67 2F 53 ...(Ljava/lang/S

000180: 74 72 69 6E 67 3B 29 56  00 21 00 05 00 06 00 00 tring;)V.!......

000190: 00 00 00 02 00 01 00 07  00 08 00 01 00 09 00 00 ................

0001A0: 00 33 00 01 00 01 00 00  00 05 2A B7 00 01 B1 00 .3........*キ..ア.

0001B0: 00 00 02 00 0A 00 00 00  0A 00 02 00 00 00 01 00 ................

0001C0: 04 00 01 00 0B 00 00 00  0C 00 01 00 00 00 05 00 ................

0001D0: 0C 00 0D 00 00 00 09 00  0E 00 0F 00 01 00 09 00 ................

0001E0: 00 00 37 00 02 00 01 00  00 00 09 B2 00 02 12 03 ..7........イ....

0001F0: B6 00 04 B1 00 00 00 02  00 0A 00 00 00 0A 00 02 カ..ア............

000200: 00 00 00 05 00 08 00 06  00 0B 00 00 00 0C 00 01 ................

000210: 00 00 00 09 00 10 00 11  00 00 00 01 00 12 00 00 ................

000220: 00 02 00 13 00 00 00 00  00 00 00 00 00 00 00 00 ................

 

Unfortunately, the hexadecimal sequence that we are looking for is not in the class file. To make the investigation easier, let's change the source code to the following:

 

public class intl2 {

 

  public static void main(String args[])

  {

    System.out.print("ABCソフト123");

  }

}

 

Hexadecimal representations of each letter in codepage 932 are shown in the table below:

 

Letter

Codepage 932

Unicode (UCS-2)

UTF-8

A

0x41

0x0041

0x41

B

0x42

0x0042

0x42

C

0x43

0x0043

0x43

0x835C

0x30BD

0xE3 0x82 0xBD

0x8374

0x30D5

0xE3 0x83 0x95

0x8367

0x30C8

0xE3 0x83 0x88

1

0x31

0x0031

0x31

2

0x32

0x0032

0x32

3

0x33

0x0033

0x33

Table 1

 

This way, the Japanese string is sandwiched between "ABC" and "123" and we should be able to figure out the encoding more easily.

 

Below is the result obtained by using tdump against the class file:

C:\programs\javatest\site>tdump intl2.class

Turbo Dump  Version 5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation

                    Display of File INTL2.CLASS

 

000000: CA FE BA BE 00 03 00 2D  00 1D 0A 00 06 00 0F 09 ハ.コセ...-........

000010: 00 10 00 11 08 00 12 0A  00 13 00 14 07 00 15 07 ................

000020: 00 16 01 00 06 3C 69 6E  69 74 3E 01 00 03 28 29 .....<init>...()

000030: 56 01 00 04 43 6F 64 65  01 00 0F 4C 69 6E 65 4E V...Code...LineN

000040: 75 6D 62 65 72 54 61 62  6C 65 01 00 04 6D 61 69 umberTable...mai

000050: 6E 01 00 16 28 5B 4C 6A  61 76 61 2F 6C 61 6E 67 n...([Ljava/lang

000060: 2F 53 74 72 69 6E 67 3B  29 56 01 00 0A 53 6F 75 /String;)V...Sou

000070: 72 63 65 46 69 6C 65 01  00 0A 69 6E 74 6C 32 2E rceFile...intl2.

000080: 6A 61 76 61 0C 00 07 00  08 07 00 17 0C 00 18 00 java............

000090: 19 01 00 0F 41 42 43 E3  82 BD E3 83 95 E3 83 88 ....ABC繧ス繝輔ヨ

0000A0: 31 32 33 07 00 1A 0C 00  1B 00 1C 01 00 05 69 6E 123...........in

0000B0: 74 6C 32 01 00 10 6A 61  76 61 2F 6C 61 6E 67 2F tl2...java/lang/

0000C0: 4F 62 6A 65 63 74 01 00  10 6A 61 76 61 2F 6C 61 Object...java/la

0000D0: 6E 67 2F 53 79 73 74 65  6D 01 00 03 6F 75 74 01 ng/System...out.

0000E0: 00 15 4C 6A 61 76 61 2F  69 6F 2F 50 72 69 6E 74 ..Ljava/io/Print

0000F0: 53 74 72 65 61 6D 3B 01  00 13 6A 61 76 61 2F 69 Stream;...java/i

000100: 6F 2F 50 72 69 6E 74 53  74 72 65 61 6D 01 00 05 o/PrintStream...

000110: 70 72 69 6E 74 01 00 15  28 4C 6A 61 76 61 2F 6C print...(Ljava/l

000120: 61 6E 67 2F 53 74 72 69  6E 67 3B 29 56 00 21 00 ang/String;)V.!.

000130: 05 00 06 00 00 00 00 00  02 00 01 00 07 00 08 00 ................

000140: 01 00 09 00 00 00 1D 00  01 00 01 00 00 00 05 2A ...............*

000150: B7 00 01 B1 00 00 00 01  00 0A 00 00 00 06 00 01 キ..ア............

000160: 00 00 00 01 00 09 00 0B  00 0C 00 01 00 09 00 00 ................

000170: 00 25 00 02 00 01 00 00  00 09 B2 00 02 12 03 B6 .%........イ....カ

000180: 00 04 B1 00 00 00 01 00  0A 00 00 00 0A 00 02 00 ..ア.............

000190: 00 00 05 00 08 00 06 00  01 00 0D 00 00 00 02 00 ................

0001A0: 0E 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 ................

 

In the section following the address 0x90, we can see "ABC" and "123":

 

000090: 19 01 00 0F 41 42 43 E3  82 BD E3 83 95 E3 83 88 ....ABC繧ス繝輔ヨ

0000A0: 31 32 33 07 00 1A 0C 00  1B 00 1C 01 00 05 69 6E 123...........in

 

Looking at this result and the table 1, we can see that the string is stored in UTF-8 in the class file.   If you are interested in knowing the format used in the class file format, please refer to the Chapter 12 of "Java Virtual Machine" by Jon Meyer & Troy Downing.

 

Next, let's take a look at how these strings are processed inside the Java program.

We will do this by making changes intl2.java so that it will dump the content of the String object.

 

public class intl3 {

 

  public static void main(String args[])

  {

    char c;

    Character cObj;

    int i;

    String s = "ABCソフト123";

 

    System.out.print(s + "\n");

 

    for(i = 0; i < s.length(); i++){

      c = s.charAt(i);

      cObj = new Character(c);

 

      System.out.print(Integer.toString(i) +

                        ":" +

                        cObj.toString() +

                        "\n");

    }

  }

}

intl3.java

 

Output is shown below:

 

C:\programs\javatest\site>java intl3

ABCソフト123

0:A

1:B

2:C

3:ソ

4:フ

5:ト

6:1

7:2

8:3

 

As shown above, during the execution of a Java program, each letter is treated as UCS-2.  In other words, each letter is treated as one "char" regardless of its character type. This is a big improvement over previous languages since you do not have to worry about checking the lead byte to determine the character set to branch out your code based on the character set.

To summarize, the following table shows how the code is encoded in each area of Java development.

 

Area

Encoding

Source code

Native encoding of the system or Unicode-escape string*

class file

UTF-8

Inside the program

UCS2

Output

Native encoding of the system

Table 2

 

Note: Unicode-escape strings will be discussed later.

Even though Java processes strings in Unicode (UTF-8, UCS2), external systems that interface with Java is not necessarily using Unicode and there is a good chance that MBCS is still used there.  Character corruption can take place whenever modules within your Java classes or third-party classes, that are used to communicate with an external system, are not implementing the conversion or are not handling it correctly.  For example, if you are storing data in an external RDBMS in the MBCS format, you might want to verify that the JDBC driver that you use can convert Unicode strings to MBCS properly to avoid data corruption.  This is because the JDBC driver is responsible for converting data passed to the methods of java.sql.Statement class  (e.g. executeUpdate method) in whatever encoding used in the database. Since java.sql.Statement class takes the String data type for a SQL statement, if your database only supports MBCS, JDBC has to be able to convert the data in Unicode to a MBCS correctly.  Since not all the characters supported in Unicode are supported in a MBCS,  in order to guarantee the integrity of data, you also have to make sure that the data that are not supported in the target MBCS will be filtered out before being passed to the JDBC driver.

 

Using javac and locale

Another thing you have to consider is the locale of the machine you run javac on.

Let's think of a situation where two engineers are working on the same source code and both of them use javac to compile the ".java" file.

In this scenario, the first engineer's machine uses the Japanese Windows XP, so javac can handle the source code that contains Shift-JIS characters correctly.  During the compilation, strings in CP932 will be converted to UTF-8 correctly as we discussed earlier.  However, if the second engineer (for example, configuration management engineer) tries to compile the same source file on a machine with the English-US locale, an error will be generated as shown below:

[root@325cds site]# javac intl4.java

intl4.java:8: illegal escape character

 

Javac uses the locale of the machine in converting from the native character set to UTF-8, and the first character of the Japanese string has 0x5c in the second byte.  Since 0x5c is assigned to the backslash in the ASCII character set, javac misinterprets it as the backslash instead of the second byte of Kanji, and generates an error.

If you remove the first Japanese letter "", the file can be compiled on the machine with the English-US locale.  However, the compiled class file now contains corrupt characters.

When you run the program on the Japanese machine, you can see this problem at offset 4 to 6 as shown below. In a sense, this is harder to detect because there is no error during the compilation. 

C:\programs\javatest\site>java intl5

ABC?t?g123

0:A

1:B