Tech and Media Labs
This site uses cookies to improve the user experience.
Unicode
  1. Unicode
  2. UTF-8


UTF-8

Jakob Jenkov
Last update: 2019-04-18

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

UTF-8 is the a very commonly used textual encoding on the web, and is thus very popular. Web browsers understand UTF-8. Many programming languages also allow you to use UTF-8 in the code, and can import and export UTF-8 text easily. Several textual data formats and markup languages are often encoded in UTF-8. For instance JSON, XML, HTML, CSS, SVG etc.

UTF-8 Marker Bits and Code Point Bits

When translating a unicode code point to one or more UTF-8 encoded bytes, each of these bytes are composed of marker bits and code point bits. The marker bits tell how to interpret the given byte. The code point bits are used to represent the value of the code point. In the following sections the marker bits are written using 0's and 1's, and the code point bits are written using the characters Z, Y, X, W and V. Each character represents a single bit.

Unicode Code Point Intervals Used in UTF-8

For unicode code points in the hexadecimal value interval U+0000 to U+007F UTF-8 uses a single byte to represent the character. The code points in this interval represent the same characters as the ASCII characters, and use the same integer values (code points) to represent them. In binary digits, the single byte representing a code point in this interval looks like this:

0ZZZZZZZ

The marker bit has the value 0. The bits representing the code point value are marked with Z.

For unicode code points in the interval U+0080 to U+07FF UTF-8 uses two bytes to represent the character. In binary digits, the two bytes representing a code point in this interval look like this:

 110YYYYY 10ZZZZZZ

The marker bits are the 110 and 10 bits of the two bytes. The Y and Z characters represents the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.

For unicode code points in the interval U+0800 to U+FFFF UTF-8 uses three bytes to represent the character. In binary digits, the three bytes representing a code point in this interval look like this:

1110XXXX 10YYYYYY 10ZZZZZZ

The marker bits are the 1110 and 10 bits of the three bytes. The X, Y and Z characters the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.

For unicode code points in the interval U+10000 to U+10FFFF UTF-8 uses four bytes to represent the character. In binary digits, the four bytes representing a code point in this interval look like this:

11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ

The marker bits are the 11110 and 10 bits of the four bytes. The bits named V and W mark the code point plane the character is from. The rest of the bits marked with X, Y and Z represent the rest of the code point. The first byte (most significant byte) is the byte on the left.

Reading UTF-8

When reading UTF-8 encoded bytes into characters, you need to figure out if a given character (code point) is represented by 1, 2, 3 or 4 bytes. You do so by looking at the bit pattern of the first byte.

If the first byte has the bit pattern 0ZZZZZZZ (most significant byte is a 0) then the character code point is represented only by this byte.

If the first byte has the bit pattern 110YYYYY (3 most significant bits are 110) then the character code point is represented by two bytes.

If the first byte has the bit pattern 1110XXXX (4 most significant bits are 1110) then the character code point is represented by three bytes.

If the first byte has the bit pattern 11110VVV (5 most significant bits are 11110) then the character code point is represented by four bytes.

Once you know how many bytes is used to represent the given character code point, read all the actual code point carrying bits (bits marked with V, W, X, Y and Z), into a single 32 bit data type (e.g a Java int). The bits then make up the integer value of the code point. Here is how a 32-bit data type looks after reading a 4-byte UTF-8 character into it:

000000 000VVVWW XXXXYYYY YYZZZZZZ

Notice how all the marker bits (the most significant bits with the patterns 11110 and 10) have been removed from all of the 4 bytes, before the remaining bits (the bits marked with A, B, C, D and E) are copied into the 32-bit data type.

Writing UTF-8

When writing UTF-8 text you need to translate unicode code points into UTF-8 encoded bytes. First, you must figure out how many bytes you need to represent the given code point. I have explained the code point value intervals at the top of this UTF-8 tutorial, so I will not repeat them here.

Second, you need to translate the bits representing the code point into the corresponding UTF-8 bytes. Once you know how many bytes are needed to represent the code point, you also know what bit pattern of marker bits and code point bits you need to use. Simply create the needed number of bytes with marker bits, and copy the correct code point bits into each of the bytes, and you are done.

Here is an example of translating a code point that requires 4 bytes in UTF-8. The code point has the abstract value (as bit pattern):

000000 000VVVWW XXXXYYYY YYZZZZZZ

The corresponding 4 UTF-8 bytes will look like this:

11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ

Searching Forwards in UTF-8

Searching forwards in UTF-8 is reasonably straightforward. You encode one character at a time, and compare it to the character you are searching for. No big surprise here.

Searching Backwards in UTF-8

The UTF-8 encoding has the nice side effect that you can search backwards in UTF-8 encoded bytes. You can see from each byte if it is the beginning of a character or not by looking at the marker bits. The following marker bit patterns all imply that the byte is the beginning of a character:

0          Beginning of 1 byte character (also an ascii character)
110        Beginning of 2 byte character
1110       Beginning of 3 byte character
11110      Beginning of 4 byte character

The following marker bit pattern implies that the byte is not the first byte of a UTF-8 character:

10         Second, third or fourth byte of a UTF-8 character

Notice how you can always see from a marker bit pattern if it is the first byte of a character, or a second / third / fourth byte. Just keeping searching backwards until you find the beginning of the character, then go forward and decode it, and check if it is the character you are looking for.

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC