JavaScript and character set

by Hideyuki Inada, Capitola Computing Inc.

 

JavaScript is becoming an essential part of the web as readers expect more and more interactivity.  As your web site can be accessed from all over the world, it is important that internationalization issues are taken into account when you write your JavaScript code.

In this document, character set of the JavaScript file will be discussed.

 

Character set of the JavaScript file

It is unlikely that you would use non-ASCII characters for your function names or variable names, so most likely, you would want to translate the message to other languages. If your JavaScript contains translatable messages, you have to consider the character set issues of your JavaScript file.  As you know, there are two ways to place your JavaScript to be invoked from your HTML files:  One way is to include your JavaScript in your HTML file itself within the <LANGUAGE="JavaScript> tag.  Another way is to write a separate standalone JavaScript file and refer to the file using the <SCRIPT SRC> tag (for example, <SCRIPT SRC="./common.js"></SCRIPT>).

 

Including JavaScript in your HTML

In this case, string literals in your script are regarded as being the same character set as the rest of the HTML file.  Let's take a look at an HTML file with one line of JavaScript code:

 

<HTML>

<HEAD>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<SCRIPT LANGUAGE="JavaScript">

function doButtonClick()

{

      alert("Thank you.");

}

</SCRIPT>

<TITLE>js_example_1</TITLE>

</HEAD>

<BODY>

<CENTER>

<FORM>

Please click the button.<P>

<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK onClick='doButtonClick()'>

</FORM>

</CENTER>

</BODY>

</HTML

js_example_1.htm

 

In this case, "Thank you.", "Please click the button." and "OK" will be regarded as encoded in ISO-8859-1 which is specified in the meta http-equiv directive.

If you want to translate this into Japanese, you can translate those three strings and set the charset to "Shift_JIS".

 

 

<HTML>

<HEAD>

<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

<SCRIPT LANGUAGE="JavaScript">

function doButtonClick()

{

      alert("有難うございます。");

}

</SCRIPT>

<TITLE>js_example_2</TITLE>

</HEAD>

<BODY>

<CENTER>

<FORM>

ボタンをクリックしてください。<P>

<INPUT TYPE=BUTTON VALUE="了解" NAME=btOK onClick='doButtonClick()'>

</FORM>

</CENTER>

</BODY>

</HTML>

js_example_2.htm

Note: Normally, you don't translate "OK" to Japanese, but in this example, it is translated for  illustration purposes.  If you open the page with Internet Explorer, the following will be displayed:

 

When you click the button, the following dialog is displayed with the "Thank you" message in Japanese:

 

Please note the all three elements, strings in HTML body, string on the button and string in JavaScript are all displayed correctly.  If you want to change the character set of your HTML file, you can simply apply the code conversion tool to the entire HTML file and specify the new character set in the meta http-equiv directive.

For example, js_example_3.htm is the same as js_example_2.htm except that is using UTF-8 (A sample source code for character set conversion tool is listed in http://www.capitolacomputing.com/intl_java_charset.htm).

 

This sounds easy to implement, but if you generate your web page dynamically from the database or server-side scripting, you may run into a problem.  For example, if you are using one database to store JavaScript and another database to store the HTML file without the JavaScript portion, you have to make sure that query results will be in the same encoding when you fetch data from those two databases to avoid character corruption. You can imagine a character corruption problem where you forgot to change the output encoding for the JavaScript potion from ISO-8859-1 to Shift_JIS even though you did this for the HTML portion.

 

As a footnote, if you are running the English version of browser, you may get the following dialog box when you click the OK button in js_example_2.htm and js_example_3.htm.

 

This is because the display of Japanese characters is not supported in the non-Japanese version of the browser that you use.  If you try it on the Japanese version of the browser, it should work.

 

Referring to a separate JavaScript file with the <SCRIPT SRC> tag

If the same function is used in many HTML files, it is a better idea to store the functions in a single JavaScript file and refer to the file using the <SCRIPT SRC> tag.

 

<HTML>

<HEAD>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<SCRIPT LANGUAGE="JavaScript" SRC="js_example_4.js"></SCRIPT>

<TITLE>js_example_4a</TITLE>

</HEAD>

<BODY>

<CENTER>

<FORM>

Please click the button.<P>

<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK onClick='doButtonClick()'>

</FORM>

</CENTER>

</BODY>

</HTML>

js_example_4a.htm

 

<HTML>

<HEAD>

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<SCRIPT LANGUAGE="JavaScript" SRC="js_example_4.js"></SCRIPT>

<TITLE>js_example_4b</TITLE>

</HEAD>

<BODY>

<CENTER>

<FORM>

Please push the button.<P>

<INPUT TYPE=BUTTON VALUE="OK" NAME=btOK onClick='doButtonClick()'>

</FORM>

</CENTER>

</BODY>

</HTML>

js_example_4b.htm

 

function doButtonClick()

{

      alert("Thank you.");

}

js_example_4.js

 

In the example above, the JavaScript function doButtonClick is defined in js_example_4.js and this file is referenced in both js_example_4a.htm and js_example_4b.htm which are basically the same as our previous example.

If you want to localize these three files to Japanese, you translate the messages in the three files and change the character set names to Shift_JIS in the meta http-equiv directive of the two HTML files.  This is shown in js_example_5a.htm, js_example_5b.htm and js_example_5.js. 

 

function doButtonClick()

{

      alert("有難うございます。");

}

js_example_5.js

 

The problem arises when you add a new file that uses a different character set and the common JavaScript file is referenced in the file.  js_example_5c.htm is encoded in UTF-8, and when you press the OK button, you will get the following dialog:

Corrupt message displayed when you click OK in js_example_5c.htm

 

As you can see, the message on the dialog is now corrupt.  This is because Internet Explorer used UTF-8 in reading the JavaScript file since UTF-8 is used for the main HTML file. 

To avoid this problem, there are three options that you can consider:

l         Use the CHARSET option in the <SCRIPT> tag.

l         Convert strings in the JavaScript file to Unicode-escape format.

l         Use the same encoding for all the HTML files that refer to the same JavaScript file.

 

Using the CHARSET option in the <SCRIPT> tag

If you specify the character set for the JavaScript file using the CHARSET option in the SCRIPT tag as shown below, strings in the JavaScript file can be interpreted correctly:

 

<SCRIPT LANGUAGE="JavaScript" CHARSET="Shift_JIS" SRC="js_example_5.js"></SCRIPT>

 

js_example_5d.htm contains this change, and if you click the "OK" button, now the message is displayed correctly:

 

 

However, not all the browsers support the CHARSET option in the SCRIPT tag, and you may need to check to see if this is supported in the versions of browsers that your website supports.  For example, this is supported in Netscape version 7.0, but not in version 6.2.1.

 

Converting strings in the JavaScript file to the Unicode-escape format.

Another alternative is to encode the strings in the JavaScript file using the Unicode-escape format.  The Unicode-escape format is to use the Unicode code point value with the \u prefix for non-ASCII characters.

If you have JDK installed on your machine, you can use the native2acii tool to do this conversion:

 

native2ascii <file name>

 

The output will be generated to the standard console, so you can redirect it to a file

Shown below is js_example_6.js which is the Unicode-escape format version of js_example_5.js:

 

function doButtonClick()

{

      alert("\u6709\u96e3\u3046\u3054\u3056\u3044\u307e\u3059\u3002");

}

js_example_6.js

 

js_example_6c.htm is the same as js_example_6c except that it contains

a link to js_example_6.js.  You can see that this works if your browser supports the Unicode-escape format in a string.

However, some browsers do not support the Unicode-escape format, so you need to verify if this feature is supported in the browsers that your site supports.

 

Using the same encoding for all the HTML files that refer to the same JavaScript file

Since the two options discussed above are not supported by all the browsers, this is the safest approach. One caveat is that this forces you to change the character set of all the relevant files at once instead of changing the file one by one, so careful scheduling will be needed.