Spiral Arm Logo

Richard's technical notes

Friday, December 01, 2006

UTF-7

Picture of a character encoding problem from a TV set-top box display

Character encoding is one of those things that needs careful attention all the time. My usual take is: "UTF-8 is the answer. What was the question?" UTF-8 has some handy properties, such as being compatible with ASCII, while being able to store any Unicode character. It's described nicely in What Is UTF-8 And Why Is It Important? and more generally in the Java Internationalization book.

But other encodings are available, and I recently had to deal with UTF-7. This was a new one on me, but if you have to deal with low-level email encoding, you'll find it's surprisingly popular. The big thing about UTF-7 is that content can be included in SMTP without having to wrapper it in base64 or some other transfer encoding.

UTF-7 is supposed to be dead because, despite its plus points, we now have things like 8BITMIME. In practice, though, it's not at all dead, and for the Java programmer this is a problem because the platform does not support UTF-7.

No panic: the language has an extension mechanism in the CharsetProvider class. Go find an open source library and drop it in. There are a couple: Zimbra have a Mozilla licensed implementation (thanks to Andy for spotting this), and there's a GPL version called JCharSet that can also be purchased as a commercial license. I went with the Zimbra one:


$ cvs -z6 -d :pserver:anonymous:@cvs.zimbra.com:/usr/local/cvsroot co main
$ cd main/ZimbraCharset
$ ant

... which produces build/zimbra-charset.jar, which looks like this:

Image showing the contents of the Zimbra Charset JAR

Drop the jar into your project, and you're away with code like: Charset utf7 = Charset.forName("UTF-7");

However, drop that library and code into a web application, and it won't work. Didn't for me, with Tomcat. You might be able to drop the library into the JRE/lib/ext folder or similar places, but I don't like that option. Instead I thought I'd try to understand what's going on. The short answer is: I don't know.

The documentation says: Charset providers may be installed in an instance of the Java platform as extensions, that is, jar files placed into any of the usual extension directories. Providers may also be made available by adding them to the applet or application class path or by some other platform-specific means. Charset providers are looked up via the current thread's context class loader. I'd have thought that WEB-INF/lib is a good extension directory, but I suspect the phrase "extension directory" is being used in a specific technical sense so perhaps WEB-INF/lib doesn't apply. Maybe it's a security issue, or perhaps it's a bug.

So for now, I've used an explicit request along the lines of:

final Charset charset;
if (false == Charset.isSupported("UTF-7"))
{
ZimbraCharsetProvider zcpi = new ZimbraCharsetProvider();
charset = zcpi.charsetForName("UTF-7);
}
else
{
charset = Charset.forName("UTF-7");
}

It's a pain, but at least that works.

Given that Java is going GPL, I presume I can take the JCharSet UTF-7 implementation and submit it as a big fix and get this sorted once and for all in the platform...

5 Comments:

Anonymous Dan Karp said...

Tomcat's class loader order requires that you place your extra charset providers in a particular place so that they're loaded in time for your app to see them. If you stick charset.jar in tomcat/common/endorsed, you'll be golden.

9:44 PM  
Blogger Richard said...

Dan - many thanks for that comment. I suspect it'll be of great use to anyone stumbling on this entry. I guess I should just clarify that my concern is that if I distribute an application (WAR file) that requires and supplies a UTF-7 implementation it seems "smelly" that I should need to mess with the container's classpath (if you see what I mean). Especially so in a shared Tomcat environment. That might just be me having too high an expectation of what can be reasonably achived.

But yes... thank you for clarifying the details for Tomcat: that is appreciated.

10:03 PM  
Blogger Richard said...

After a quick search it seems you cannot do what you want.

The following explains several scenarios and how you might package them.

http://java.sun.com/j2ee/verified/packaging.html

10:03 PM  
Anonymous Jaap Beetstra said...

Just wanted to point out another implementation:
http://sourceforge.net/projects/jutf7/

It has an MIT license, so it's easily used in all kinds of projects, and is pre-packaged as opposed to the zimbra one.

11:57 AM  
Blogger Iliyana Angelova said...

Thank you very much for this post. It was very helpful for me.

Regards

12:41 PM  

Post a Comment

<< Home