JavaScript and character set
JavaScript is becoming an essential
part of the web as readers expect more and more interactivity. As your web site can be accessed from
all over the world, it is important that internationalization issues are taken
into account when you write your JavaScript code.
In
this document, character set of the JavaScript
file will be
discussed.
Character set of the JavaScript file
It is
unlikely that you would use non-ASCII characters for your function names or
variable names, so most likely, you would want to translate the message to other
languages. If your JavaScript contains translatable
messages, you have to consider the character set issues of your JavaScript
file. As you know, there are two
ways to place your JavaScript to be invoked from your HTML files: One way is to include your JavaScript in
your HTML file itself within the <LANGUAGE="JavaScript> tag. Another way is to write a separate
standalone JavaScript file and refer to the file using the <SCRIPT SRC>
tag (for example, <SCRIPT
SRC="./common.js"></SCRIPT>).
Including JavaScript in your HTML
In
this case, string literals in your script are regarded as being the same
character set as the rest of the HTML file. Let's take a look at an HTML file with
one line of JavaScript code:
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<SCRIPT LANGUAGE="JavaScript">
function doButtonClick()
{
alert("Thank you.");
}
</SCRIPT>
<TITLE>js_example_1</TITLE>
</HEAD>
<BODY>
<CENTER>
<FORM>
Please click the button.<P>
<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK onClick='doButtonClick()'>
</FORM>
</CENTER>
</BODY>
</HTML
In
this case, "Thank you.", "Please click
the button."
and "OK" will be regarded as encoded in
ISO-8859-1 which is specified in the meta http-equiv
directive.
If
you want to translate this into Japanese, you can translate those three strings
and set the charset to "Shift_JIS".
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
<SCRIPT LANGUAGE="JavaScript">
function doButtonClick()
{
alert("有難うございます。");
}
</SCRIPT>
<TITLE>js_example_2</TITLE>
</HEAD>
<BODY>
<CENTER>
<FORM>
ボタンをクリックしてください。<P>
<INPUT TYPE=BUTTON VALUE="了解" NAME=btOK onClick='doButtonClick()'>
</FORM>
</CENTER>
</BODY>
</HTML>
Note: Normally, you
don't translate "OK" to
Japanese, but in this example, it is translated for illustration purposes. If you open the page with Internet
Explorer, the following will be displayed:

When
you click the button, the following dialog is displayed with the "Thank you"
message in Japanese:

Please note the all three elements, strings in
HTML body, string on the button and string in
JavaScript are all displayed correctly. If you want to change the character set
of your HTML file, you can simply apply the code conversion tool to the entire
HTML file and specify the new character set in the meta
http-equiv directive.
For
example, js_example_3.htm is the same as
js_example_2.htm except that is using UTF-8 (A sample source code for character
set conversion tool is listed in http://www.capitolacomputing.com/intl_java_charset.htm).
This
sounds easy to implement, but if you generate your web page dynamically from the
database or server-side scripting, you may run into a problem. For example, if you are using one
database to store JavaScript and another database to store the HTML file without
the JavaScript portion, you have to make sure that query results will be in the
same encoding when you fetch data from those two databases to avoid character
corruption. You can imagine a character corruption problem where you forgot to
change the output encoding for the JavaScript potion from ISO-8859-1 to
Shift_JIS even though you did this for the HTML portion.
As a
footnote, if you are running the English version of browser, you may get the
following dialog box when you click the OK button in js_example_2.htm and
js_example_3.htm.

This
is because the display of Japanese characters is not supported in the
non-Japanese version of the browser that you use. If you try it on the Japanese version of
the browser, it should work.
Referring to a separate JavaScript file with the <SCRIPT SRC> tag
If
the same function is used in many HTML files, it is a better idea to store the
functions in a single JavaScript file and refer to the file using the <SCRIPT
SRC> tag.
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<SCRIPT LANGUAGE="JavaScript" SRC="js_example_4.js"></SCRIPT>
<TITLE>js_example_4a</TITLE>
</HEAD>
<BODY>
<CENTER>
<FORM>
Please click the button.<P>
<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK onClick='doButtonClick()'>
</FORM>
</CENTER>
</BODY>
</HTML>
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<SCRIPT LANGUAGE="JavaScript"
SRC="js_example_4.js"></SCRIPT>
<TITLE>js_example_4b</TITLE>
</HEAD>
<BODY>
<CENTER>
<FORM>
Please push the button.<P>
<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK
onClick='doButtonClick()'>
</FORM>
</CENTER>
</BODY>
</HTML>
function doButtonClick()
{
alert("Thank you.");
}
js_example_4.js
In
the example above, the JavaScript function doButtonClick is defined in
js_example_4.js and this file is referenced in both js_example_4a.htm and
js_example_4b.htm which are basically the same as our previous
example.
If
you want to localize these three files to Japanese, you translate the messages
in the three files and change the character set names to Shift_JIS in the
meta http-equiv directive of the two HTML files. This is shown in js_example_5a.htm, js_example_5b.htm and
js_example_5.js.
function doButtonClick()
{
alert("有難うございます。");
}
js_example_5.js
The
problem arises when you add a new file that uses a different character set and
the common JavaScript file is referenced in the file. js_example_5c.htm is encoded in
UTF-8, and when you press the OK button, you will get the following
dialog:

Corrupt message
displayed when you click OK in js_example_5c.htm
As
you can see, the message on the dialog is now corrupt. This is because Internet Explorer used
UTF-8 in reading the JavaScript file since UTF-8 is used for the main HTML
file.
To
avoid this problem, there are three options that you can
consider:
l Use the CHARSET option in the <SCRIPT> tag.
l Convert strings in the JavaScript file to Unicode-escape format.
l Use the same encoding for all the HTML files that refer to the same JavaScript file.
Using the CHARSET option in the <SCRIPT> tag
If
you specify the character set for the JavaScript file using the CHARSET option
in the SCRIPT tag as shown below, strings in the JavaScript file can be
interpreted correctly:
<SCRIPT LANGUAGE="JavaScript" CHARSET="Shift_JIS"
SRC="js_example_5.js"></SCRIPT>
js_example_5d.htm contains this
change, and if you click the "OK" button, now the message is displayed
correctly:

However, not all the browsers
support the CHARSET option in the SCRIPT tag, and you may need to check to see if this is supported in the versions of
browsers that your website supports. For example, this is supported in
Netscape version 7.0, but not in version 6.2.1.
Converting strings in the JavaScript file to the Unicode-escape format.
Another alternative is to encode the
strings in the JavaScript file using the Unicode-escape format. The Unicode-escape format is to use the
Unicode code point value with the \u prefix for non-ASCII
characters.
If
you have JDK installed on your machine, you can use the native2acii tool to do
this conversion:
native2ascii <file
name>
The
output will be generated to the standard console, so you can redirect it to a
file
Shown
below is js_example_6.js which is the Unicode-escape format version of
js_example_5.js:
function doButtonClick()
{
alert("\u6709\u96e3\u3046\u3054\u3056\u3044\u307e\u3059\u3002");
}
js_example_6.js
js_example_6c.htm is the same as
js_example_6c except that it contains
a
link to js_example_6.js. You can
see that this works if your browser supports the Unicode-escape format in a
string.
However, some browsers do not support
the Unicode-escape format, so you need to verify if this feature is supported in
the browsers that your site supports.
Using the same encoding for all the HTML files that refer to the same JavaScript file
Since
the two options discussed above are not supported by all the browsers, this is
the safest approach. One caveat is that this forces you to change the character
set of all the relevant files at once instead of changing the file one by one,
so careful scheduling will be needed.