Tech and Media Labs
This site uses cookies to improve the user experience.




Java Internationalization: Converting to and from Unicode

Jakob Jenkov
Last update: 2014-06-23

Internally in Java all strings are kept in unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode. Additionally, when the application outputs text it may have to convert the internal unicode format to whatever format the outside world needs.

Java has a few different methods you can use to convert text to and from unicode. These methods are:

  • The String class
  • The Reader and Writer classes and subclasses

I will explain both methods in the sections below.

Converting to and from Unicode Using the String Class

You can use the String class to convert a byte array to a String instance. You do so using the constructor of the String class. Here is an example:

byte[] bytes = new byte[10];

String str = new String(bytes, Charset.forName("UTF-8"));

System.out.println(str);

This example first creates a byte array. The byte array does not actually contain any sensible data, but for the sake of the example, that does not matter. The example then creates a new String, passing the byte array and the character set of the characters in the byte array as parameters to the constructor. The String constructor will then convert the bytes from the character set of the byte array to unicode.

You can convert the text of a String to another format using the getBytes() method. Here is an example:

bytes[] bytes = str.getBytes(Charset.forName("UTF-8"));

You can also write unicode characters directly in strings in the code, by escaping the with \u. Here is an example:

// The danish letters Æ Ø Å
    String myString = "\u00C6\u00D8\u00C5" ;

Converting to and from Unicode Using the Reader and Writer Classes

The Reader and Writer classes are stream oriented classes that enable a Java application to read and write streams of characters. Both classes are explained in my Java IO tutorial. Go to Reader or Writer to read more.

Here is an example that uses an InputStreamReader to convert from a certain character set (UTF-8) to unicode:

InputStream inputStream = new FileInputStream("c:\\data\\utf-8-text.txt");
Reader      reader      = new InputStreamReader(inputStream,
                                                Charset.forName("UTF-8"));

int data = reader.read();
while(data != -1){
    char theChar = (char) data;
    data = reader.read();
}

reader.close();

This example creates a FileInputStream and wraps it in a InputStreamReader. The InputStreamReader is told to interprete the characters in the file as UTF-8 characters. This is done using the second constructor paramter in the InputStreamReader class.

Here is an example writing a stream of characters back out to UTF-8:

OutputStream outputStream = new FileOutputStream("c:\\data\\output.txt");
Writer       writer       = new OutputStreamWriter(outputStream,
                                                   Charset.forName("UTF-8"));

writer.write("Hello World");

writer.close();

This example creates an OutputStreamWriter which converts the string written through it to the UTF-8 character set.

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC