Tech and Media Labs
This site uses cookies to improve the user experience.
Unicode
  1. Unicode
  2. UTF-8


Unicode

Jakob Jenkov
Last update: 2019-04-13

Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character's Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.

Unicode Code Points

As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value. The code point integer values go from 0 to 10FFFF (in hexadecimal encoding).

When referring to a unicode code point in writing, we write a U+ and then the hexadecimal representation of the code point. For instance, the uppercase character A is represented as U+0041. This notation is only when referring to the code points in text, though.

On the byte encoding level the unicode characters are encoded differently. The uppercase character A does not need 6 bytes (6 ascii characters) when encoded as bytes. Again, the exact number of bytes used depends on whether you are using an UTF-8, UTF-16 encoding etc.

To create a text using unicode characters you use a series of unicode code points. For instance, the sequence U+0041 U+0042 U+0043 makes up the text ABC.

Special Characters

Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:

IntervalDescription
U+0000 - U+001FControl characters
U+007F - U+009FControl characters
U+DB00 - U+DFFFSurrogate pairs
U+E000 - U+F8FFPrivate use area
U+F0000 - U+FFFFFPrivate use area
U+100000 - U+10FFFFPrivate use area

Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.

Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.

Unicode Planes

Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to F (in hexadecimal encoding). You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.

As example, the unicode code point U+0041 would become U+000041 of which the first two hexadecimal digits are 00. Thus the unicode code point U+0041 belongs to unicode plane 0.

Along the same logic, the code point U+10FFFF is already 6 hexadecimal digits long, and thus does not need any zeroes added in front of it. The first two hexadecimal digits are 10 which translates to 16 in decimal digits. Thus, the code point U+10FFFF belongs to unicode plane 16.

Non-character Code Points

The last 2 characters of each unicode plane are non-characters.

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC