Character set and Java
Introduction
Character set conversion is one of
trickiest areas in Java. In this section, we will look at several areas that can
cause character corruption problems if you are not aware of the underlying
character set or encoding issues.
Character set used in Java and javac
First, let's take a look at some
simple code
that contains a Japanese string, and go through the compilation process to see
what character sets are used in each of the file formats that are used in the
typical development cycle.
In
this example, the string is output to the console, and the codepage for the console is set to codepage 932
which is used for displaying the source code and the
output.
Codepage 932 is commonly known as Shift JIS and is the standard character set
for personal computers used in Japan:
C:\programs\javatest\site>chcp
現在のコード
ページ: 932
Shown
below is the source code displayed to the same console using the "type"
command:
C:\programs\javatest\site>type intl1.java
public class intl1 {
public static void
main(String args[])
{
System.out.print("ソフトウェアの国際化のデモ");
}
}
intl1.java
If you compile intl1.java using javac, and run
it with the "java" command, you'll see the following output on the
console:
C:\programs\javatest\site>java intl1
ソフトウェアの国際化のデモ
Output of intl1.class
As you can see:
l Source code is encoded in codepage 932.
l
Output is generated in codepage
932.
At this point, we still do not know how the string is encoded in the class file. A natural guess would be that those letters are encoded in codepage 932 as well.
If the strings are encoded in codepage 932, the string "ソフトウェアの国際化のデモ" should be stored as (shown in hexadecimal), and we should be able to find these in the class file (unless the literal strings are compressed):
835c 8374 8367 8345 8346 8341 82cc 8d91 8ddb 89bb 82cc 8366 8382
Let's dump the content to see if this assumption is true:
C:\programs\javatest\site>tdump intl1.class
Turbo Dump Version
5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File INTL1.CLASS
000000: CA FE BA BE 00 03 00 2D
00 22 0A 00 06 00 14 09 ハ.コセ...-."......
000010: 00 15 00 16 08 00 17 0A
00 18 00 19 07 00 1A 07 ................
000020: 00 1B 01 00 06 3C 69 6E
69 74 3E 01 00 03 28 29 .....<init>...()
000030: 56 01 00 04 43 6F 64 65
01 00 0F 4C 69 6E 65 4E V...Code...LineN
000040: 75 6D 62 65 72 54 61 62
6C 65 01 00 12 4C 6F 63 umberTable...Loc
000050: 61 6C 56 61 72 69 61 62
6C 65 54 61 62 6C 65 01 alVariableTable.
000060: 00 04 74 68 69 73 01 00
07 4C 69 6E 74 6C 31 3B ..this...Lintl1;
000070: 01 00 04 6D 61 69 6E 01
00 16 28 5B 4C 6A 61 76 ...main...([Ljav
000080: 61 2F 6C 61 6E 67 2F 53
74 72 69 6E 67 3B 29 56 a/lang/String;)V
000090: 01 00 04 61 72 67 73 01
00 13 5B 4C 6A 61 76 61 ...args...[Ljava
0000A0: 2F 6C 61 6E 67 2F 53 74
72 69 6E 67 3B 01 00 0A /lang/String;...
0000B0: 53 6F 75 72 63 65 46 69
6C 65 01 00 0A 69 6E 74 SourceFile...int
0000C0: 6C 31 2E 6A 61 76 61 0C
00 07 00 08 07 00 1C 0C l1.java.........
0000D0: 00 1D 00 1E 01 00 27 E3
82 BD E3 83 95 E3 83 88 ......'繧ス繝輔ヨ
0000E0: E3 82 A6 E3 82 A7 E3 82
A2 E3 81 AE E5 9B BD E9 繧ヲ繧ァ繧「縺ョ蝗ス髫
0000F0: E9 9B E5 8C 96 E3 81 AE
E3 83 87 E3 83 A2 07 00
帛喧縺ョ繝・Δ..
000100: 1F 0C 00 20 00 21 01 00
05 69 6E 74 6C 31 01 00 ... .!...intl1..
000110: 10 6A 61 76 61 2F 6C 61
6E 67 2F 4F 62 6A 65 63 .java/lang/Objec
000120: 74 01 00 10 6A 61 76 61
2F 6C 61 6E 67 2F 53 79 t...java/lang/Sy
000130: 73 74 65 6D 01 00 03 6F
75 74 01 00 15 4C 6A 61 stem...out...Lja
000140: 76 61 2F 69 6F 2F 50 72
69 6E 74 53 74 72 65 61 va/io/PrintStrea
000150: 6D 3B 01 00 13 6A 61 76
61 2F 69 6F 2F 50 72 69 m;...java/io/Pri
000160: 6E 74 53 74 72 65 61 6D
01 00 05 70 72 69 6E 74 ntStream...print
000170: 01 00 15 28 4C 6A 61 76
61 2F 6C 61 6E 67 2F 53 ...(Ljava/lang/S
000180: 74 72 69 6E 67 3B 29 56
00 21 00 05 00 06 00 00 tring;)V.!......
000190: 00 00 00 02 00 01 00 07
00 08 00 01 00 09 00 00 ................
0001A0: 00 33 00 01 00 01 00 00
00 05 2A B7 00 01 B1 00 .3........*キ..ア.
0001B0: 00 00 02 00 0A 00 00 00
0A 00 02 00 00 00 01 00 ................
0001C0: 04 00 01 00 0B 00 00 00
0C 00 01 00 00 00 05 00 ................
0001D0: 0C 00 0D 00 00 00 09 00
0E 00 0F 00 01 00 09 00 ................
0001E0: 00 00 37 00 02 00 01 00
00 00 09 B2 00 02 12 03 ..7........イ....
0001F0: B6 00 04 B1 00 00 00 02
00 0A 00 00 00 0A 00 02 カ..ア............
000200: 00 00 00 05 00 08 00 06
00 0B 00 00 00 0C 00 01 ................
000210: 00 00 00 09 00 10 00 11
00 00 00 01 00 12 00 00 ................
000220: 00 02 00 13 00 00 00 00
00 00 00 00 00 00 00 00 ................
Unfortunately, the hexadecimal sequence that we are looking for is not in the class file. To make the investigation easier, let's change the source code to the following:
public class intl2 {
public static void
main(String args[])
{
System.out.print("ABCソフト123");
}
}
Hexadecimal representations of each letter in codepage 932 are shown in the table below:
|
Letter |
Codepage
932 |
Unicode
(UCS-2) |
UTF-8 |
|
A |
0x41 |
0x0041 |
0x41 |
|
B |
0x42 |
0x0042 |
0x42 |
|
C |
0x43 |
0x0043 |
0x43 |
|
ソ |
0x835C |
0x30BD |
0xE3 0x82 0xBD |
|
フ |
0x8374 |
0x30D5 |
0xE3 0x83 0x95 |
|
ト |
0x8367 |
0x30C8 |
0xE3 0x83 0x88 |
|
1 |
0x31 |
0x0031 |
0x31 |
|
2 |
0x32 |
0x0032 |
0x32 |
|
3 |
0x33 |
0x0033 |
0x33 |
Table 1
This way, the Japanese string is sandwiched between "ABC" and "123" and we should be able to figure out the encoding more easily.
Below is the result obtained by using tdump against the class file:
C:\programs\javatest\site>tdump intl2.class
Turbo Dump Version
5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File INTL2.CLASS
000000: CA FE BA BE 00 03 00 2D
00 1D 0A 00 06 00 0F 09 ハ.コセ...-........
000010: 00 10 00 11 08 00 12 0A
00 13 00 14 07 00 15 07 ................
000020: 00 16 01 00 06 3C 69 6E
69 74 3E 01 00 03 28 29 .....<init>...()
000030: 56 01 00 04 43 6F 64 65
01 00 0F 4C 69 6E 65 4E V...Code...LineN
000040: 75 6D 62 65 72 54 61 62
6C 65 01 00 04 6D 61 69 umberTable...mai
000050: 6E 01 00 16 28 5B 4C 6A
61 76 61 2F 6C 61 6E 67 n...([Ljava/lang
000060: 2F 53 74 72 69 6E 67 3B
29 56 01 00 0A 53 6F 75 /String;)V...Sou
000070: 72 63 65 46 69 6C 65 01
00 0A 69 6E 74 6C 32 2E rceFile...intl2.
000080: 6A 61 76 61 0C 00 07 00
08 07 00 17 0C 00 18 00 java............
000090: 19 01 00 0F 41 42 43 E3
82 BD E3 83 95 E3 83 88 ....ABC繧ス繝輔ヨ
0000A0: 31 32 33 07 00 1A 0C 00
1B 00 1C 01 00 05 69 6E 123...........in
0000B0: 74 6C 32 01 00 10 6A 61
76 61 2F 6C 61 6E 67 2F tl2...java/lang/
0000C0: 4F 62 6A 65 63 74 01 00
10 6A 61 76 61 2F 6C 61 Object...java/la
0000D0: 6E 67 2F 53 79 73 74 65
6D 01 00 03 6F 75 74 01 ng/System...out.
0000E0: 00 15 4C 6A 61 76 61 2F
69 6F 2F 50 72 69 6E 74 ..Ljava/io/Print
0000F0: 53 74 72 65 61 6D 3B 01
00 13 6A 61 76 61 2F 69 Stream;...java/i
000100: 6F 2F 50 72 69 6E 74 53
74 72 65 61 6D 01 00 05 o/PrintStream...
000110: 70 72 69 6E 74 01 00 15
28 4C 6A 61 76 61 2F 6C print...(Ljava/l
000120: 61 6E 67 2F 53 74 72 69
6E 67 3B 29 56 00 21 00 ang/String;)V.!.
000130: 05 00 06 00 00 00 00 00 02 00 01 00 07 00 08 00
................
000140: 01 00 09 00 00 00 1D 00
01 00 01 00 00 00 05 2A ...............*
000150: B7 00 01 B1 00 00 00 01
00 0A 00 00 00 06 00 01 キ..ア............
000160: 00 00 00 01 00 09 00 0B
00 0C 00 01 00 09 00 00 ................
000170: 00 25 00 02 00 01 00 00
00 09 B2 00 02 12 03 B6 .%........イ....カ
000180: 00 04 B1 00 00 00 01 00
0A 00 00 00 0A 00 02 00 ..ア.............
000190: 00 00 05 00 08 00 06 00
01 00 0D 00 00 00 02 00 ................
0001A0: 0E 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 ................
In the section following the address 0x90, we can see "ABC" and "123":
000090: 19 01 00 0F 41 42 43
E3 82 BD E3 83 95 E3 83 88
....ABC繧ス繝輔ヨ
0000A0: 31 32 33 07 00 1A
0C 00 1B 00 1C 01 00 05 69 6E
123...........in
Looking at this result and the table 1, we can see that the string is stored in UTF-8 in the class file. If you are interested in knowing the format used in the class file format, please refer to the Chapter 12 of "Java Virtual Machine" by Jon Meyer & Troy Downing.
Next, let's take a look at how these strings are processed inside the Java program.
We will do this by making changes intl2.java so that it will dump the content of the String object.
public class intl3 {
public static void
main(String args[])
{
char
c;
Character
cObj;
int
i;
String s =
"ABCソフト123";
System.out.print(s + "\n");
for(i = 0;
i < s.length(); i++){
c = s.charAt(i);
cObj = new Character(c);
System.out.print(Integer.toString(i) +
":" +
cObj.toString() +
"\n");
}
}
}
intl3.java
Output is shown below:
C:\programs\javatest\site>java intl3
ABCソフト123
0:A
1:B
2:C
3:ソ
4:フ
5:ト
6:1
7:2
8:3
As shown above, during the execution of a Java program, each letter is treated as UCS-2. In other words, each letter is treated as one "char" regardless of its character type. This is a big improvement over previous languages since you do not have to worry about checking the lead byte to determine the character set to branch out your code based on the character set.
To summarize, the following table shows how the code is encoded in each area of Java development.
|
Area |
Encoding |
|
Source code |
Native encoding of the system or Unicode-escape string* |
|
class file |
UTF-8 |
|
Inside the program |
UCS2 |
|
Output |
Native encoding of the system |
Table 2
Note: Unicode-escape strings will be discussed later.
Even though Java processes strings in
Unicode (UTF-8, UCS2), external systems that
interface with Java is not
necessarily using Unicode and there is a good chance that
MBCS is still used there. Character corruption can take place whenever modules within your Java classes or third-party
classes, that are used to communicate with an
external system, are
not implementing the conversion or are not handling it
correctly. For example, if you are storing data in
an external RDBMS in the MBCS format, you might want to verify that the JDBC
driver that you use can convert Unicode strings to MBCS properly to avoid data
corruption. This is because the
JDBC driver is responsible for converting data passed to the methods of
java.sql.Statement class (e.g.
executeUpdate method) in whatever encoding used in the database. Since
java.sql.Statement class takes the String data type for a SQL statement, if your
database only supports MBCS, JDBC has to be able to convert the data in Unicode
to a MBCS correctly. Since not all
the characters supported in Unicode are supported in a MBCS, in order to guarantee the integrity of
data, you also have to make sure that the data that are not supported in the
target MBCS will be filtered out before being passed to the JDBC
driver.
Using javac and locale
Another thing you have to consider is
the locale of the machine you run javac on.
Let's think of a situation where two
engineers are working on the same source code and both of them use javac to
compile the ".java" file.
In this scenario, the first engineer's
machine uses the Japanese Windows XP, so javac can handle the source code that
contains Shift-JIS characters correctly.
During the compilation, strings in CP932 will be converted to UTF-8
correctly as we discussed earlier.
However, if the second engineer (for example, configuration management
engineer) tries to compile the same source file on a machine with the English-US
locale, an error will be generated as shown below:
[root@325cds site]# javac intl4.java
intl4.java:8: illegal escape character
Javac uses the locale of the machine in
converting from the native character set to UTF-8, and the first character of
the Japanese string has 0x5c in the second byte. Since 0x5c is assigned to the backslash
in the ASCII character set, javac misinterprets it as the backslash instead of
the second byte of Kanji, and generates an error.
If you remove the first Japanese letter
"ソ", the file can be compiled
on the machine with the English-US locale.
However, the compiled class file now contains corrupt
characters.
When you run the program on the
Japanese machine, you can see this problem at offset 4 to 6 as shown below. In a
sense, this is harder to detect because there is no error during the
compilation.
C:\programs\javatest\site>java intl5
ABC?t?g123
0:A
1:B
2:C
3:?
4:t
5:?
6:g
7:1
8:2
9:3
Since javac converts the source code
assuming that the source code is encoded in the native encoding on the
engineer's machine unless the "-encoding" option is specified. In this case, the
locale of the machine is English-US, so the Japanese characters end up becoming
corrupt in the generated .class file. (If the source code is compiled with
"javac -encoding SJIS" or "javac -encoding SHIFT_JIS", javac will compile the
file correctly on the English machine.)
To avoid this problem, in a situation
where multiple engineers are using different locales on their machines, it is
safer to store the strings in the Unicode-escape format.
If you run the tool "native2ascii"
against the source code above, "フト" is converted to
"\u30d5\u30c8". This string
represents the sequence of hexadecimal Unicode code points with a leading escape
character. If you store the strings
this way, non-ASCII strings in the source code can be converted to correct code
point in UTF-8 on an English machine even without specifying the "-encoding"
option since the non-ASCII strings are already encoded in Unicode. By setting a
rule that all the source code needs to be encoded in Unicode-escape sequence
before checked into the source code control system, you can guarantee that the
source code is clean and free from character corruptions.
However, there is one caveat with this
approach from the developer's perspective. This will reduce readability of the
messages for the developer.
However, this problem can be solved by incorporating native2ascii into
the check-in process. Namely, you can provide the check-in script that runs
"native2ascii" during the check-in. For the check-out script, you can design it
such way that "native2ascii" is invoked with the "-reverse" option so that the
messages will be converted back to the engineer's native
encoding.
To sum up, there are two approaches to
compile a source file that contains non-ASCII strings on an English
machine:
l Compile the source code using javac with the "-encoding" option
l Convert the source code using native2ascii and compile it with javac
FILE I/O
The same principle applies to file I/O
as compiling a file containing non-ASCII strings using javac. When you read
external data or write data to an external system, you have to understand the
encoding used in the external system and decide where the conversion should take
place.
Let's consider the case where you want
to read a file called "legacy.txt" that contains the following text in codepage
932. The file is in a plain text format and has the same string that we have
used in the previous examples.
ABCソフト123
legacy.txt
The following code is used to read the
text file:
import java.io.*;
public class intl9 {
public static void
main(String args[])
{
char
c;
Character
cObj;
int
i;
String
sFile = "legacy.txt";
FileInputStream fis;
File
f;
long
lLength;
byte
byFile[];
String s =
"";
//
Code
try{
f = new File(sFile);
lLength = f.length();
byFile = new byte[(int)lLength]; // Allocate buffer
fis = new FileInputStream(f);
fis.read(byFile); // Copy to the buffer
s = new String(byFile);
fis.close();
}
catch(Exception e){
System.out.print(e.getMessage());
}
for(i = 0;
i < s.length(); i++){
c = s.charAt(i);
cObj = new Character(c);
System.out.print(Integer.toString(i) +
":" +
cObj.toString() +
"\n");
}
}
}
If you run this code, you will get the
following output:
C:\programs\javatest\site>java intl9
0:A
1:B
2:C
3:ソ
4:フ
5:ト
6:1
7:2
8:3
In this example, raw file data is read
into the byte array, and then converted to a Java String in the constructor of
the String class. Since we are not
specifying any encoding in the constructor, the constructor uses the encoding of
the system on which JVM is running.
Since legacy.txt is encoded in codepage 932, and the machine is running
on codepage 932, this will not cause any data corruption. What if the encoding of the file is
different from the machine that is reading the file? Then you would have to specify the
encoding of the byte array in the String class constructor as
following:
s = new String(byFile, "Shift_JIS");
By specifying the encoding, you can
process the file correctly on a machine which is set to a different encoding
from the encoding of the file.
Using InputStreamReader
Another approach is to use the
InputStreamReader class. The InputStreamReader class is a subclass of the Reader
class which is designed to read the character stream rather than the binary
stream. In the code below, a FileInputStream object is instantiated as in the
previous example, but the instance is passed to the constructor of the
InputStreamReader class so that the InputStreamReader class can do the
conversion from the byte stream to the character stream using the encoding that
we specify.
import java.io.*;
public class intl12 {
public static void
main(String args[])
{
char
c;
int
i;
String
sFile = "legacy.txt";
FileInputStream fis;
File
f;
long
lLength;
char
acFile[] = null;
String s =
"";
int
n;
InputStreamReader isr;
BufferedReader br;
int
nResult;
//
Code
try{
f = new File(sFile);
acFile = new char[4096]; // Allocate buffer
fis = new FileInputStream(f);
isr = new InputStreamReader(fis, "Shift_JIS");
br = new BufferedReader(isr);
nResult = br.read(acFile,
0, //
offset
4096);
s = new String(acFile, 0, nResult);
br.close();
isr.close();
fis.close();
}
catch(Exception e){
System.out.print("Exception thrown:" +
e.getMessage());
}
for(i = 0;
i < s.length(); i++){
c = s.charAt(i);
n = (int)c;
System.out.print(Integer.toString(i) +
":" +
Integer.toHexString(n) +
"\n");
}
}
}
If the encoding of the file is the same
as the one that is used on the machine that reads the text, you can use the
FileReader class as shown below.
import java.io.*;
public class intl13 {
public static void
main(String args[])
{
char
c;
int
i;
String
sFile = "legacy.txt";
FileInputStream fis;
File
f;
long
lLength;
char
acFile[] = null;
String s =
"";
int
n;
FileReader
fr;
BufferedReader br;
int
nResult;
//
Code
try{
f = new File(sFile);
acFile = new char[4096]; // Allocate buffer
fr = new FileReader(f);
br = new BufferedReader(fr);
nResult = br.read(acFile,
0, //
offset
4096);
s = new String(acFile, 0, nResult);
br.close();
fr.close();
}
catch(Exception e){
System.out.print("Exception thrown:" +
e.getMessage());
}
for(i = 0;
i < s.length(); i++){
c = s.charAt(i);
n = (int)c;
System.out.print(Integer.toString(i) +
":" +
Integer.toHexString(n) +
"\n");
}
}
}
However, the FileReader does not have a
constructor that takes the encoding option, so if you need to specify the
encoding, you need to use one of the approaches discussed
earlier.
Code conversion tool
To illustrate how code conversion can be
used in practical applications, I wrote a simple tool to convert the encoding of
a file.
Usage
java CodeConversion <input file>
-input_encoding=<input encoding> -output_encoding=<output
encoding>
Output will be generated to standard console,
so you can redirect it to a file if needed.
The
complete source code is shown below:
//
// Copyright (C) 2002 Hideyuki Inada, Capitola Computing Inc. All rights reserved
//
//
// Usage:
// java CodeConversion <input file> -input_encoding=<input encoding> -output_encoding=<output encoding>
// Output will be generated to standard console, so you can redirect it to a file if needed.
// Below is an example of how to convert the file "test.txt" from Shift_JIS
// and output to console in UTF-8:
//
// java CodeConversion test.txt -input_encoding=Shift_JIS -output_encoding=UTF-8
//
import java.util.*;
import java.io.*;
public class CodeConversion {
public static void main(String args[])
{
// Constant
String sOption1 = "-input_encoding=";
String sOption2 = "-output_encoding=";
String sUsage = "Usage : java CodeConversion <input file name> " +
sOption1 + "<encoding of the input file> " +
sOption2 + "<encoding of the output>\n\n" +
"Please note that no space is allowed before and after the equal sign.";
// Var
int i;
String sFile;
FileInputStream fis;
File fIn;
long lLength;
byte byFileIn[];
byte byFileOut[];
String s = "";
String sInputEncoding;
String sOutputEncoding;
int nResult;
// Code
try{
if(args.length != 3){
showUsage(sUsage);
System.exit(-1);
}
// Parse command line
// File name
sFile = args[0];
// Input encoding
nResult = args[1].indexOf(sOption1);
if(nResult == -1){
System.exit(-1);
}
sInputEncoding = args[1].substring(sOption1.length());
// Output encoding
nResult = args[2].indexOf(sOption2);
if(nResult == -1){
System.exit(-1);
}
sOutputEncoding = args[2].substring(sOption2.length());
// Read input file
fIn = new File(sFile); // If the file is not found, an exception will be thrown which we'll catch
lLength = fIn.length();
byFileIn = new byte[(int)lLength]; // Allocate buffer
fis = new FileInputStream(fIn);
fis.read(byFileIn); // Copy to the buffer
s = new String(byFileIn, sInputEncoding); // Convert from input encoding to Unicode
fis.close();
// Output file
byFileOut = s.getBytes(sOutputEncoding); // Convert from Unicode to output encoding
System.out.write(byFileOut, 0, byFileOut.length);
}
catch(Exception e){
if(e.getClass().getName() == "java.io.UnsupportedEncodingException"){
System.err.print("Specified encoding is not supported: " + e.getMessage());
}
else{
System.err.print(e.getClass().getName() + ": " + e.getMessage());
}
}
}
static void showUsage(String sUsage)
{
// Code
System.err.print(sUsage);
}
}
Following is the output
result:
C:\programs\javatest\site>java CodeConversion legacy.txt -input_encoding=Shift_J
IS -output_encoding=UTF-8 > utf8.txt
C:\programs\javatest\site>tdump legacy.txt
Turbo Dump Version 5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File LEGACY.TXT
000000: 41 42 43 83 5C 83 74 83 67 31 32 33 00 00 00 00 ABCソフト123....
C:\programs\javatest\site>tdump utf8.txt
Turbo Dump Version 5.0.16.6 Copyright (c) 1988, 1999 Inprise Corporation
Display of File UTF8.TXT
000000: 41 42 43 E3 82 BD E3 83
95 E3 83 88 31 32 33 00 ABC繧ス繝輔ヨ123.
If
you are not sure about what encoding name to use, you can consult with the
character set registry by IANA.