Character set and Java
Introduction
Character set conversion is one of
trickiest areas in Java. In this section, we will look at several areas that can
cause character corruption problems if you are not aware of the underlying
character set or encoding issues.
Character set used in Java and javac
First, let's take a look at some
simple code
that contains a Japanese string, and go through the compilation process to see
what character sets are used in each of the file formats that are used in the
typical development cycle.
In
this example, the string is output to the console, and the codepage for the console is set to codepage 932
which is used for displaying the source code and the
output.
Codepage 932 is commonly known as Shift JIS and is the standard character set
for personal computers used in Japan:
C:\programs\javatest\site>chcp
現在のコード
ページ: 932
Shown
below is the source code displayed to the same console using the "type"
command:
C:\programs\javatest\site>type intl1.java
public class intl1 {
public static void
main(String args[])
{
System.out.print("ソフトウェアの国際化のデモ");
}
}
intl1.java
If you compile intl1.java using javac, and run
it with the "java" command, you'll see the following output on the
console:
C:\programs\javatest\site>java intl1
ソフトウェアの国際化のデモ
Output of intl1.class
As you can see:
l Source code is encoded in codepage 932.
l
Output is generated in codepage
932.
At this point, we still do not know how the string is encoded in the class file. A natural guess would be that those letters are encoded in codepage 932 as well.
If the strings are encoded in codepage 932, the string "ソフトウェアの国際化のデモ" should be stored as (shown in hexadecimal), and we should be able to find these in the class file (unless the literal strings are compressed):
835c 8374 8367 8345 8346 8341 82cc 8d91 8ddb 89bb 82cc 8366 8382
Let's dump the content to see if this assumption is true:
C:\programs\javatest\site>tdump intl1.class
Turbo Dump Version
5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File INTL1.CLASS
000000: CA FE BA BE 00 03 00 2D
00 22 0A 00 06 00 14 09 ハ.コセ...-."......
000010: 00 15 00 16 08 00 17 0A
00 18 00 19 07 00 1A 07 ................
000020: 00 1B 01 00 06 3C 69 6E
69 74 3E 01 00 03 28 29 .....<init>...()
000030: 56 01 00 04 43 6F 64 65
01 00 0F 4C 69 6E 65 4E V...Code...LineN
000040: 75 6D 62 65 72 54 61 62
6C 65 01 00 12 4C 6F 63 umberTable...Loc
000050: 61 6C 56 61 72 69 61 62
6C 65 54 61 62 6C 65 01 alVariableTable.
000060: 00 04 74 68 69 73 01 00
07 4C 69 6E 74 6C 31 3B ..this...Lintl1;
000070: 01 00 04 6D 61 69 6E 01
00 16 28 5B 4C 6A 61 76 ...main...([Ljav
000080: 61 2F 6C 61 6E 67 2F 53
74 72 69 6E 67 3B 29 56 a/lang/String;)V
000090: 01 00 04 61 72 67 73 01
00 13 5B 4C 6A 61 76 61 ...args...[Ljava
0000A0: 2F 6C 61 6E 67 2F 53 74
72 69 6E 67 3B 01 00 0A /lang/String;...
0000B0: 53 6F 75 72 63 65 46 69
6C 65 01 00 0A 69 6E 74 SourceFile...int
0000C0: 6C 31 2E 6A 61 76 61 0C
00 07 00 08 07 00 1C 0C l1.java.........
0000D0: 00 1D 00 1E 01 00 27 E3
82 BD E3 83 95 E3 83 88 ......'繧ス繝輔ヨ
0000E0: E3 82 A6 E3 82 A7 E3 82
A2 E3 81 AE E5 9B BD E9 繧ヲ繧ァ繧「縺ョ蝗ス髫
0000F0: E9 9B E5 8C 96 E3 81 AE
E3 83 87 E3 83 A2 07 00
帛喧縺ョ繝・Δ..
000100: 1F 0C 00 20 00 21 01 00
05 69 6E 74 6C 31 01 00 ... .!...intl1..
000110: 10 6A 61 76 61 2F 6C 61
6E 67 2F 4F 62 6A 65 63 .java/lang/Objec
000120: 74 01 00 10 6A 61 76 61
2F 6C 61 6E 67 2F 53 79 t...java/lang/Sy
000130: 73 74 65 6D 01 00 03 6F
75 74 01 00 15 4C 6A 61 stem...out...Lja
000140: 76 61 2F 69 6F 2F 50 72
69 6E 74 53 74 72 65 61 va/io/PrintStrea
000150: 6D 3B 01 00 13 6A 61 76
61 2F 69 6F 2F 50 72 69 m;...java/io/Pri
000160: 6E 74 53 74 72 65 61 6D
01 00 05 70 72 69 6E 74 ntStream...print
000170: 01 00 15 28 4C 6A 61 76
61 2F 6C 61 6E 67 2F 53 ...(Ljava/lang/S
000180: 74 72 69 6E 67 3B 29 56
00 21 00 05 00 06 00 00 tring;)V.!......
000190: 00 00 00 02 00 01 00 07
00 08 00 01 00 09 00 00 ................
0001A0: 00 33 00 01 00 01 00 00
00 05 2A B7 00 01 B1 00 .3........*キ..ア.
0001B0: 00 00 02 00 0A 00 00 00
0A 00 02 00 00 00 01 00 ................
0001C0: 04 00 01 00 0B 00 00 00
0C 00 01 00 00 00 05 00 ................
0001D0: 0C 00 0D 00 00 00 09 00
0E 00 0F 00 01 00 09 00 ................
0001E0: 00 00 37 00 02 00 01 00
00 00 09 B2 00 02 12 03 ..7........イ....
0001F0: B6 00 04 B1 00 00 00 02
00 0A 00 00 00 0A 00 02 カ..ア............
000200: 00 00 00 05 00 08 00 06
00 0B 00 00 00 0C 00 01 ................
000210: 00 00 00 09 00 10 00 11
00 00 00 01 00 12 00 00 ................
000220: 00 02 00 13 00 00 00 00
00 00 00 00 00 00 00 00 ................
Unfortunately, the hexadecimal sequence that we are looking for is not in the class file. To make the investigation easier, let's change the source code to the following:
public class intl2 {
public static void
main(String args[])
{
System.out.print("ABCソフト123");
}
}
Hexadecimal representations of each letter in codepage 932 are shown in the table below:
|
Letter |
Codepage
932 |
Unicode
(UCS-2) |
UTF-8 |
|
A |
0x41 |
0x0041 |
0x41 |
|
B |
0x42 |
0x0042 |
0x42 |
|
C |
0x43 |
0x0043 |
0x43 |
|
ソ |
0x835C |
0x30BD |
0xE3 0x82 0xBD |
|
フ |
0x8374 |
0x30D5 |
0xE3 0x83 0x95 |
|
ト |
0x8367 |
0x30C8 |
0xE3 0x83 0x88 |
|
1 |
0x31 |
0x0031 |
0x31 |
|
2 |
0x32 |
0x0032 |
0x32 |
|
3 |
0x33 |
0x0033 |
0x33 |
Table 1
This way, the Japanese string is sandwiched between "ABC" and "123" and we should be able to figure out the encoding more easily.
Below is the result obtained by using tdump against the class file:
C:\programs\javatest\site>tdump intl2.class
Turbo Dump Version
5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File INTL2.CLASS
000000: CA FE BA BE 00 03 00 2D
00 1D 0A 00 06 00 0F 09 ハ.コセ...-........
000010: 00 10 00 11 08 00 12 0A
00 13 00 14 07 00 15 07 ................
000020: 00 16 01 00 06 3C 69 6E
69 74 3E 01 00 03 28 29 .....<init>...()
000030: 56 01 00 04 43 6F 64 65
01 00 0F 4C 69 6E 65 4E V...Code...LineN
000040: 75 6D 62 65 72 54 61 62
6C 65 01 00 04 6D 61 69 umberTable...mai
000050: 6E 01 00 16 28 5B 4C 6A
61 76 61 2F 6C 61 6E 67 n...([Ljava/lang
000060: 2F 53 74 72 69 6E 67 3B
29 56 01 00 0A 53 6F 75 /String;)V...Sou
000070: 72 63 65 46 69 6C 65 01
00 0A 69 6E 74 6C 32 2E rceFile...intl2.
000080: 6A 61 76 61 0C 00 07 00
08 07 00 17 0C 00 18 00 java............
000090: 19 01 00 0F 41 42 43 E3
82 BD E3 83 95 E3 83 88 ....ABC繧ス繝輔ヨ
0000A0: 31 32 33 07 00 1A 0C 00
1B 00 1C 01 00 05 69 6E 123...........in
0000B0: 74 6C 32 01 00 10 6A 61
76 61 2F 6C 61 6E 67 2F tl2...java/lang/
0000C0: 4F 62 6A 65 63 74 01 00
10 6A 61 76 61 2F 6C 61 Object...java/la
0000D0: 6E 67 2F 53 79 73 74 65
6D 01 00 03 6F 75 74 01 ng/System...out.
0000E0: 00 15 4C 6A 61 76 61 2F
69 6F 2F 50 72 69 6E 74 ..Ljava/io/Print
0000F0: 53 74 72 65 61 6D 3B 01
00 13 6A 61 76 61 2F 69 Stream;...java/i
000100: 6F 2F 50 72 69 6E 74 53
74 72 65 61 6D 01 00 05 o/PrintStream...
000110: 70 72 69 6E 74 01 00 15
28 4C 6A 61 76 61 2F 6C print...(Ljava/l
000120: 61 6E 67 2F 53 74 72 69
6E 67 3B 29 56 00 21 00 ang/String;)V.!.
000130: 05 00 06 00 00 00 00 00 02 00 01 00 07 00 08 00
................
000140: 01 00 09 00 00 00 1D 00
01 00 01 00 00 00 05 2A ...............*
000150: B7 00 01 B1 00 00 00 01
00 0A 00 00 00 06 00 01 キ..ア............
000160: 00 00 00 01 00 09 00 0B
00 0C 00 01 00 09 00 00 ................
000170: 00 25 00 02 00 01 00 00
00 09 B2 00 02 12 03 B6 .%........イ....カ
000180: 00 04 B1 00 00 00 01 00
0A 00 00 00 0A 00 02 00 ..ア.............
000190: 00 00 05 00 08 00 06 00
01 00 0D 00 00 00 02 00 ................
0001A0: 0E 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 ................
In the section following the address 0x90, we can see "ABC" and "123":
000090: 19 01 00 0F 41 42 43
E3 82 BD E3 83 95 E3 83 88
....ABC繧ス繝輔ヨ
0000A0: 31 32 33 07 00 1A
0C 00 1B 00 1C 01 00 05 69 6E
123...........in
Looking at this result and the table 1, we can see that the string is stored in UTF-8 in the class file. If you are interested in knowing the format used in the class file format, please refer to the Chapter 12 of "Java Virtual Machine" by Jon Meyer & Troy Downing.
Next, let's take a look at how these strings are processed inside the Java program.
We will do this by making changes intl2.java so that it will dump the content of the String object.
public class intl3 {
public static void
main(String args[])
{
char
c;
Character
cObj;
int
i;
String s =
"ABCソフト123";
System.out.print(s + "\n");
for(i = 0;
i < s.length(); i++){
c = s.charAt(i);
cObj = new Character(c);
System.out.print(Integer.toString(i) +
":" +
cObj.toString() +
"\n");
}
}
}
intl3.java
Output is shown below:
C:\programs\javatest\site>java intl3
ABCソフト123
0:A
1:B
2:C
3:ソ
4:フ
5:ト
6:1
7:2
8:3
As shown above, during the execution of a Java program, each letter is treated as UCS-2. In other words, each letter is treated as one "char" regardless of its character type. This is a big improvement over previous languages since you do not have to worry about checking the lead byte to determine the character set to branch out your code based on the character set.
To summarize, the following table shows how the code is encoded in each area of Java development.
|
Area |
Encoding |
|
Source code |
Native encoding of the system or Unicode-escape string* |
|
class file |
UTF-8 |
|
Inside the program |
UCS2 |
|
Output |
Native encoding of the system |
Table 2
Note: Unicode-escape strings will be discussed later.
Even though Java processes strings in
Unicode (UTF-8, UCS2), external systems that
interface with Java is not
necessarily using Unicode and there is a good chance that
MBCS is still used there. Character corruption can take place whenever modules within your Java classes or third-party
classes, that are used to communicate with an
external system, are
not implementing the conversion or are not handling it
correctly. For example, if you are storing data in
an external RDBMS in the MBCS format, you might want to verify that the JDBC
driver that you use can convert Unicode strings to MBCS properly to avoid data
corruption. This is because the
JDBC driver is responsible for converting data passed to the methods of
java.sql.Statement class (e.g.
executeUpdate method) in whatever encoding used in the database. Since
java.sql.Statement class takes the String data type for a SQL statement, if your
database only supports MBCS, JDBC has to be able to convert the data in Unicode
to a MBCS correctly. Since not all
the characters supported in Unicode are supported in a MBCS, in order to guarantee the integrity of
data, you also have to make sure that the data that are not supported in the
target MBCS will be filtered out before being passed to the JDBC
driver.
Using javac and locale
Another thing you have to consider is
the locale of the machine you run javac on.
Let's think of a situation where two
engineers are working on the same source code and both of them use javac to
compile the ".java" file.
In this scenario, the first engineer's
machine uses the Japanese Windows XP, so javac can handle the source code that
contains Shift-JIS characters correctly.
During the compilation, strings in CP932 will be converted to UTF-8
correctly as we discussed earlier.
However, if the second engineer (for example, configuration management
engineer) tries to compile the same source file on a machine with the English-US
locale, an error will be generated as shown below:
[root@325cds site]# javac intl4.java
intl4.java:8: illegal escape character
Javac uses the locale of the machine in
converting from the native character set to UTF-8, and the first character of
the Japanese string has 0x5c in the second byte. Since 0x5c is assigned to the backslash
in the ASCII character set, javac misinterprets it as the backslash instead of
the second byte of Kanji, and generates an error.
If you remove the first Japanese letter
"ソ", the file can be compiled
on the machine with the English-US locale.
However, the compiled class file now contains corrupt
characters.
When you run the program on the
Japanese machine, you can see this problem at offset 4 to 6 as shown below. In a
sense, this is harder to detect because there is no error during the
compilation.
C:\programs\javatest\site>java intl5
ABC?t?g123
0:A
1:B