Search
Search Menu

Java in a global word – some internationalization pitfalls

For the second time in the past years I had to get a Java based web-application ready for international use, which means support for various international character sets. The first time I didn’t have much experience yet and the application was difficult to debug, so it was a tiresome and lengthy trial and error approach. The second time it was easier since we had a better architecture and tools but also here some unexpected problems showed up.

There were basically three aspects, which made internationalization difficult:

  1. All levels of a system and application are effected (e.g. operating system, I/O operations, database, web-tier, web-services). If the settings don’t match at some point, it might results in wrong character conversion. Usually this means we have to look at database encoding, file system encoding and http request/response encoding.
  2. A lot of software which is being used (e.g. 3rd party components and libraries) uses a default character encoding, which is usually ISO-8859-1 (Latin-1). Is this western ignorance?
  3. You don’t have the time to understand the complicated and often unclear issues surrounding character set encoding. If the world could just decide to switch to one global encoding, our live would be much easier.

The goal is to use only one character encoding within the Java application (in our case UTF-8 seems to be fine for the job), so we only have to handle different character sets at entry and exit points (web, files, …).

Let’s look at places which might need some attention:

Application servers

Start these with the correct JVM arguments.

Default file encoding
The default file encoding is being used by the InputStreamReader and OutputStreamReader. If it is not set, the file encoding of the operating system will be used which can lead to unexpected results if you have a team which works on different systems or if the deployment system differs from the development environment. Set it with -Dfile.encoding=UTF-8

Next check and if necessary configure the default character encoding.

Resin
The default value is ISO-8859-1.
see Specify the default character encoding for the environment.

Tomcat
Default encoding of Tomcat 5 is UTF-8. If not, you can specify it in $CATALINA_BASE/conf/web.xml or in your webapp’s own web.xml.

WebSphere
Default character encoding is UTF-8. For more information see: Developing J2EE Global Applications : Character Encoding

Database

Switch the complete database, individual tables or individual columns to UTF-8. How this can be done differs per database system.

For some JDBC database drivers you have to specify the encoding explicitly, others drivers are smart enough to determine the database encoding automatically.

Oracle 10g
In order to review the current settings enter SELECT * FROM V$NLS_PARAMETERS;
NLS_CHARACTERSET and NLS_LENGTH_SEMANTICS are interesting for us. Oracle recommends using Unicode character set AL32UTF8 for all new system deployments.
If you don’t want to change the settings for the database, you can use the NCHAR, NVARCHAR2, and NCLOB datatypes instead. Their default encoding is AL16UTF16.

Additional information:
Changing The Character Set In Oracle Applications
Character Semantics and Globalization

Spring Framework

Since Spring handles you requests, it needs some extra configuration:
Add filter to web.xml and spring configuration
In this case the Spring framework does most of the work for you. Without such framework you might have to do some conversion between different character encoding types yourself.

Java Servlet Pages

Use: <%@ page pageEncoding=”UTF-8″ contentType=“text/html; charset=UTF-8” %>

To set the default page encoding used for all jsp files, use

<jsp-property-group>
(…)
<page-encoding>utf-8</page-encoding>
</jsp-property-group>

Additional information:
Setting Properties for Groups of JSP Pages

It is a good idea to add
<META http-equiv=”Content-Type” content=”text/html; charset=UTF-8″>
to the html as well.

Templating Engines and layout frameworks

Sitemesh
By default Sitemesh uses ISO-8859-1. All used JSP pages should define UTF-8 as encoding. If you have various decorators and includes, these must all use the same encoding.

Velocity
Also Velocity uses ISO-8859-1 as default. This has been the large pitfall on my first internationalization project. I wasted a lot of time before I knew this.
Velocity allows you to specify the character encoding of your template resources on a template by template basis. The output encoding is an application specific setting and can be set in the runtime configuration with following configuration key: output.encoding (it might be a good idea to set input.encoding as well)
More information: Velocity Developer Guide

Freemarker
You can specify the charset of the template in the template itself and the charset of the output with the
setOutputEncoding(outputCharset) method of a Freemarker processing environment.

Resource bundles

Edit message bundles for non western languages in UTF-8 mode and then convert this file to an ascii format for Java. Call native2ascii encoding, specifying the original file has UTF-8 encoding:
native2ascii -encoding UTF-8 messages_cn.txt messages_cn.properties

Additional information

Java World: Multibyte-character processing in J2EE
An in-depth look at Java’s character type
A tutorial on character code issues

1 Comment Write a comment

  1. “If you don’t want to change the settings for the database, you can use the NCHAR, NVARCHAR2, and NCLOB datatypes instead. Their default encoding is AL16UTF16.”

    didn’t you mean VARCHAR2 instead of NVARCHAR2 here ?

Leave a Comment

Required fields are marked *.