Tech and Media Labs
This site uses cookies to improve the user experience.




ION Encoding

Jakob Jenkov
Last update: 2017-04-12

The IAP Object Notation (ION) is a binary data format which is flexible enough to encode a wide variety of data. The abbreviation of IAP Object Notation would be IAP-ON - which we have further shortened to ION. The name ION is simply easier to write and pronounce.

Please keep in mind that the ION encoding is not yet 100% finalized. The big decisions are final, like how ION fields in general are encoded etc. but we are still not 100% fixed on the fine details. We believe we are around 99,9% there. We believe the current field types are stable now, but we might still extend ION with new field types.

The first version of ION (1.0) will only contain what we know for sure makes sense to keep in ION. The field types that we are actually using and which have significant functions. Everything else will be decided later, once we gain more experience with the current field type set.

Also, we haven't yet settled on the "extended types". We have some ideas for a few fields that could be encoded as extended types, but we have not yet analyzed these in detail.

Change Log

Here is a short log of changes to the ION encoding.

Element Count Now an Int64-Positive

On April 8th 2017 the element count of ION Array fields was changed, and required for ION Table fields too.

The element count of ION Array fields was an extended field type. Now that is changed to a mandatory Int64-Positive. That means, that the first field inside a non-null ION Array field must be an Int64-Positive field.

ION Table fields now also have a mandatory element count field as an Int64-Positive field. This field must also be the very first field inside the ION Table field, before any of the key fields.

Field Allocations Changed

On May 20th 2016 we have made changes to the ION field encodings. We hope that this version will be the final encoding for ION v. 1.0. We have not changed the 6 encoding types (Normal, Short, Tiny, Extended Normal, Extended Short, Extended Tiny), but the allocation of fields types to type codes has changed.

The field types "Copy" and "Complex Type ID Short" have been moved to extended field types. Furthermore, since they are not actually used by IAP Tools we have temporarily "suspended" them. They will most likely return in later version of ION.

Array, Table and Object has new field type codes, but keep their encodings.

The field type "Tiny" has been renamed back to "Boolean" and should now only be used to contain boolean values. The encoding for the Boolean field type is still the Tiny encoding type (1 byte), not to be confused with the former Tiny field type.

Two core field type codes are now unused - reserved for future core field types that need compact encodings.

Future versions / extensions of the ION encoding should be fully backwards compatible with this version, meaning we expect no further changes to the current field types and their type code allocations.

Negative Integer Encoding Changed

On January 13th 2017 we changed the encoding of negative integers (INT64-Negative). Rather than being encoded as the absolute value of the negative integer value, negative integers are now encoded as the absolute value -1 .

The reason for the -1 addition to the encoding is to allow all negative numbers to be encoded. With the absolute only encoding, the largest negative integer (64 bit) value -263 could not be encoded in 8 bytes, and would thus be hard to decode using a standard long (64 bit) variable. With the absolute(val) - 1 encoding all N-byte negative values can be encoded within N bytes.

ION Encoding

ION uses a type-length-value (TLV) encoding. An ION encoded data structure consist of one or more "fields". Each field uses a TLV encoding, meaning each field starts with a field type followed by the length of the field value and finally the field value itself.

To avoid allocating a fixed number of bytes to represent the length of a field, ION actually has two length parts. The first length part tells the number in bytes of the length-of-value counter. The second length part is the bytes making up the length-of-value counter. Thus, ION really uses a TLLV (type, length-of-length, length-of-value, value) encoding. This TLLV format enables ION to both contain very large fields, while at the same time being able to encode small data types more compactly.

ION Fields

All ION data types are encoded as ION fields. Some ION fields contain binary encoded data (raw bytes, numbers, text etc.) and some ION fields contain other ION fields nested inside them.

Basic Encoding

The basic encoding for an ION field consists of:

  • 1 lead byte containing field type and number of length bytes (4 bits each)
  • 0..15 length bytes
  • 0..2^120 value bytes

As you can see, an ION field can contain values that are up to 2^120 bytes long. If you need to encode larger blocks of data than that, you would need to break it up into multiple fields.

This is the basic encoding of an ION field. As you will see later, fields can use slightly different encodings (variations of the above) to encode data more compactly.

The Lead Byte

The lead byte of an ION field contains the field type and the number of length bytes that follow the lead byte.

The field type takes up the top 4 bits of the lead byte (the 4 most significant bits). This gives a total of 16 different core field types. By combining fields of these 16 field types you can make pretty complex objects. One of these 16 core field types is an extension type so more types can be defined later, to add more field types to the ION field set. Extended field types are explained later.

The number of length bytes takes up the bottom 4 bits of the lead byte (the 4 least significant bits). This gives a number of length bytes from 0 to 15.

The Length Bytes

The length bytes make up a 0 to 15 byte long number. This number is encoded using network byte order, meaning the most significant byte comes first, and the least significant byte comes last.

The Value Bytes

After the length bytes comes the value bytes (if any). The order of the value bytes depends on what the bytes represent. Numbers are encoded into the value bytes using network byte order, meaning the most significant byte comes first, and the least significant byte comes last. If you write your own data types into the byte value, you can choose whatever byte order that makes sense for that data type.

Encoding Variations

As mentioned earlier, ION fields can be encoded using variations of the basic field encoding. In total there are 6 variations, when including the basic encoding in that number. The encodings are:

  • Normal
  • Short
  • Tiny
  • Extended Normal
  • Extended Short
  • Extended Tiny

Normal, Short and Tiny

The three first ION field encodings are illustrated here:

ION Field Encodings Illustrated.

A normal field encoding consists of 1 lead byte, 0..15 length bytes and 0..2^120 value bytes.

A short field encoding consists of 1 lead byte and 0..15 value bytes.

A tiny field encoding consists of 1 lead byte only. The value of the field is contained in the lower 4 bits of the lead byte.

Extended Normal, Short and Tiny

Each of the three above mentioned encodings exist in an "extended" version, where the lead byte is followed byte 1 or 2 type bytes (exactly how many type bytes can follow is not yet 100% defined).

The extended versions of the first three encodings are illustrated here:

ION Extended Field Categories Illustrated.

Null Values

A field which has the number of length bytes (4 least significant bits of lead byte) set to 0 is assumed to have a value of null. Such a field will have no length bytes, and no value. A field with a null value is thus only 1 byte long - the lead byte. All fields can assume the value null.

Primitive and Complex Fields

ION's many field types can be divided into two groups: Primitive fields and complex fields.

Primitive ION fields contain some kind of primitive data. This could be a boolean, byte, short, int, long, float, double, a byte sequence, UTF-8 string etc. Primitive fields can use any of the 6 field encodings.

Complex ION fields contain other ION fields inside them. Examples of complex fields are objects, tables and arrays. Additionally, the "bytes" field type is theoretically a primitive type because it contains raw bytes, but in practice you could nest serialized ION fields inside a bytes field too.

Since complex field types usually contain other fields inside them, their length is often longer than 15 bytes. Therefore complex field types only use the Normal and Extended Normal field encodings.

Core Field Types

The core field types are the 15 field types that use the field codes 0 to 14. These field types are encoded using either the Normal, Short or Tiny field encoding. The core field types thus only have a single lead byte specifying its type.

ION contains the following core field types:

TypeCodeEncodingDescription
Bytes 0 Normal A sequence of raw bytes.
Boolean 1 Tiny Can contain the value of 0 (=null), or 1 (= true) and 2 (= false) .
Int64-Positive 2 Short An up to 8 byte long positive (unsigned) integer
Int64-Negative 3 Short An up to 8 byte long negative integer.
Float 4 Short Contains either a 32 bit or 64 bit floating point number.
UTF-8 5 Normal Contains a variable length sequence of UTF-8 encoded characters.
UTF-8-Short 6 Short Contains a variable length sequence of UTF-8 encoded characters of maximally 15 characters.
UTC-Date-Time 7 Short Contains a date + optionally a time in UTC date + time (no time zone).
Reserved 8 - Not yet assigned.
Reserved 9 - Not yet assigned.
Array 10 Normal A list of ION fields. Could be anything. The elements are not related to each other, like they are in objects and tables.
Table 11 Normal A list of the exact same type of objects. The key fields (property names) of the objects are only included once, but all value fields of all properties of all objects are included in the table.
Object 12 Normal A sequence of key and value fields making up an object with property names and property values.
Key 13 Normal A key - e.g. a property name in an Object, or the key of a key,value pair in a hashtable.
Key-Short 14 Short Like Key, but represented as a short field, meaning it can be used for all keys that are 15 bytes or less long.
Extended 15 * Signals that this is an extended field, meaning the type of the field is read from the 1-2 type bytes following the lead byte. The encoding used (Normal, Short, Tiny) depends on the extended field type.

Each of these field types and their encodings will be explained in more detail later in this text.

Extended Field Types

Extended field types are all the field types that use extended encodings (Extended Normal, Extended Short and Extended Tiny). Extended encodings use an extra type byte after the lead byte which contains the field type. The lead byte just contains the field type "extended", so it is necessary to look at the following byte to

The "code" column in the following table is not the field type code in the lead byte, but the code in the type byte following the lead byte.

TypeCodeEncodingDescription

Extended Field Types - Suggestions

The following extended field types are suggested field types, but which are not yet in use, and not yet implemented in IAP Tools for Java. These field types will most likely change (!!!) so consider them, but don't rely on them.

TypeCodeEncodingDescription
Complex-Type-Id * Extended Normal Contains a longer complex type id, e.g. a Java class name. Not 100% finalized.
Copy * Extended Short Represents a reference to an ION field located earlier in the same ION data. Used e.g. to represent an object reference to another object, so circular object references can be represented.
Reference-Back * Extended Short Represents a reference to an ION field located earlier in the same ION data. Used e.g. to represent an object reference to another object, so circular object references can be represented.
Reference-Forward * Extended Short Represents a reference to an ION field located later in the same ION data. Used e.g. to represent an object reference to another object, so circular object references can be represented. Not 100% finalized.
Cache-Reference * Extended Normal Represents a reference to an ION field located in the cache of the other party communicating via the same network connection. Intended to be used in conjunction with IAP. Not 100% finalized.
Cache-Reference-Short * Extended Short Represents a short reference (key <= 15 bytes) to an ION field located in the cache of the other party communicating via the same network connection. Intended to be used in conjunction with IAP. Not 100% finalized.
UTC-Time * Extended Short Contains a time of day in UTC time (no time zones), or a duration.Not 100% finalized.

Core Field Type Encodings

The core ION field types are those 16 field types that are not extended types. Extended types require 1 or 2 extra field type bytes, remember?

Bytes

The bytes field type is the most basic field in ION. A bytes field just contains an opaque sequence of fields. You have no information about what these bytes represent. The bytes field type can be used to transfer files, voice data and other similar byte sequences, where knowing the exact data format is not necessary in order to transfer it across a network. The bytes field type can also be used as fallback when no other ION field types match the data you want to send.

A bytes field uses a normal field encoding, so it consists of a lead byte, 0 to 15 length bytes and 0 to 2^120 value bytes.

Boolean

The Boolean field type uses a tiny field encoding. It can contain either a value of 0 (null), 1 (true) or 2 (false).

Int64-Positive

The Int64 Positive field type can contain 0 to 8 bytes making up a max 64 bit unsigned number.

The Int64 Positive is a short field. Thus the length of the field value is written directly into the 4 least significant bits of the lead byte (a length of 0 means a null value).

Int64-Negative

The Int64 Negative field type can contain 0 to 8 bytes making up a max 64 bit unsigned number. Negative integers are encoded as positive integers using this simple formula:

encoded = absolute(negativeValue + 1);

You can calculate the encoded value like this:

encoded = -(negativeValue + 1);

Using this encoding -1 is encoded as 0, and -2^7 is encoded as 2^7 - 1 .

The reason that negative integers are sent across the wire as positive numbers is that 2-complement negative numbers always take up the maximum number of bytes possible. That means 32 bit for a 32 bit integer and 64 bits for a 64 bit integer. By converting the negative values to positive values we can represent negative numbers with fewer bytes.

The Int64 Negative is a compact field. Thus the length of the field value is written directly into the 4 least significant bits of the lead byte (a length of 0 means a null value).

Float

The Float field contains either 4 bytes or 8 bytes making up either a 32 bit or 64 bit floating point number. The Float field uses a short field encoding. That means, that how many bytes the Float field contains is stored directly in the 4 least significant bits of the lead byte. The Float field thus have no explicit length bytes.

The bits in the 32 bit or 64 bit floating point number correspond to the bits returned by Java's Float.floatToIntBits() and Double.doubleToLongBits() functions, which follow the "IEEE 754 floating-point "single format" bit layout" and "IEEE 754 floating-point "double format" bit layout".

UTF-8

The UTF-8 field can contain a variable length sequence of UTF-8 encoded characters. The UTF-8 field is a normal length field, meaning it has separate length bytes. Thus, the number of length bytes used to represent the length of the field are written into the least 4 significant bits of the lead byte. After the lead bytes comes the length bytes, and after that the UTF-8 encoded characters.

A null UTF-8 field is encoded with a lead byte that has 0 written into the 4 least significant bits of the lead byte. A null UTF-8 field is thus just 1 byte long.

An empty string is different than a null string. A UTF-8 field containing an empty string should have the length length (the 4 least significant bytes) set to the value 1 (= 1 length byte). Then the lead byte should be followed by a single length byte with the value 0. An empty string UTF-8 field should have no value bytes. Thus, an empty string UTF-8 field consists of 2 bytes. The lead byte + 1 length byte with the value 0.

UTF-8-Short

The UTF-8-SHORT field is like the UTF-8 field except it can only contain up to 15 bytes (not UTF-8 characters - bytes!). The UTF-8-SHORT field uses a short encoding, meaning the number of bytes contained in the UTF-8-Short field is written into the 4 least significant bytes of the lead byte. A length of 0 means null.

A UTF-8-Short field is 1 byte shorter than the same string encoded using a UTF-8 field. Short strings are often transmitted over the wire, so having a more compact field type for short strings is often useful. Examples of short strings are telephone numbers, email addresses (short ones), zip codes, city names, first names, last names, product codes, serial numbers, hash values (short ones) etc.

Since a UTF-8-Short field never has any explicit length bytes, you cannot encode an empty string as a UTF-8-Short. An empty string is encoded using 1 length byte with the value 0. Thus, empty strings can only be encoded using UTF-8 fields.

UTC-Date-Time

The UTC-Date-Time field contains a UTC date (year, month, day) and optionally a time (hours, minutes, seconds, milliseconds / microseconds / nanoseconds).

The UTC-Date-Time will have no time zone information. All dates written into a UTC-Date-Time field should be represented as UTC date and time. Conversion to and from different time zones should happen when the UTC-Date-Time field is written and read. Do not transfer local times in the UTC-Date-Time field.

The UTC-Date-Time field uses a Short field encoding. It uses a binary date format which is similar to the textual ISO date format. The binary date format only uses half the bytes the textual ISO date format uses, and is faster to read and write.

The UTC-Date-Time field encodes date and time information like this:

Year 2 bytes - values from 0 to 65535.
Month 1 byte - values from 1 to 12.
Day of month 1 byte - values from 1 to 31.
Hour of day 1 byte - values from 0 to 23.
Minutes 1 byte - values from 0 to 59.
Seconds 1 byte - values from 0 to 59 (60 when leap seconds occur).
Milliseconds 2 bytes - values from 0 to 999.
Microseconds 3 bytes - values from 0 to 999,999.
Nanoseconds 4 bytes - values from 0 to 999,999,999.

The length of a UTC-Date-Time field specifies how much date and time information the field contains. If the length is 2 bytes, then the UTC-Date-Time field only contains a year. If the length is 3 bytes, then year + month, if the length is 4 bytes, then year + month + day etc.

As you can see, a date with year, month, day, hour, minutes and seconds takes 7 bytes to encode. Compare that to the same compressed ISO date string: 20150301235959 . The compressed ISO date string is 14 bytes. Exactly double of the ION encoding. In fact, a correct ISO date encoding must have a T between date and time, plus a Z at the end to signal "no time zone". That is a total of 16 bytes.

ION's date-time encoding is also more compressed when it comes to milliseconds, and it gives you the option to send microseconds and nanoseconds too. Something you cannot do with the ISO date format (as far as we know).

By the way, you should only provide sub-second time as either milliseconds, microseconds or nanoseconds. In other words, only as either 2, 3 or 4 bytes. Not as 2 + 3 + 4 bytes (this is not valid).

As you can see, there is no valid length of 8 bytes. It's either 2,3,4,5,6,7, 9,10 or 11. Right now a length of 8 bytes has no special meaning (it is simply invalid) but it could be used to represent a 64-bit integer (long) with the number of milliseconds since 1970. Similarly, a length of 12 bytes could be used to represent an 64-bit integer containing seconds, and a 32-bit integer containing nanoseconds (like Java's new date format does). These 8 and 12 byte representations are not yet decided on, though. You can express just as fine grained time without the 8 and 12 byte modes. It could only make it easier to convert to internal date / time representations of your programming language.

Array

The Array field is a normal length field just like Table and Object. The Array field is intended to contain lists of data of the same kind (but you could used mixed the field types if you need / want that). The difference between the Table and the Array field is that the Array field does not have any Key / Key-Compact sequence in the beginning to represent the names of the columns like a Table has. An Array field just contains the value fields themselves. Each value field inside an Array field is considered independent from any other field in the same Array field. This is different from how Table and Object associate Key / Key-Compact fields with value fields.

A non-null ION Array must contain an ION Int64Pos field listing the number of elements in the array. This element must be the very first element inside the ION Array. Knowing the number of elements in an array makes it easier to allocate an array of the correct size when reading an ION Array into objects (e.g. Java objects). The Int64-Positive field should be located before any of the element fields inside the ION Array fields.

Table

The Table field is a normal length field just like the Object field. The Table field is intended for tabular data, similar to a CSV file, or lists of objects of the same type.

A Table field must contain an Int64-Positive as the very first element inside the ION Table. This Int64-Positive field must be located before any of the Key / Key-Compact fields used to identify the columns of the table. This Int64-Positive must contain the number of rows in the Table field. Knowing the number of rows in the table makes it possible to allocate the right size array for the table elements before reading the elements.

After the row count Int64-Positive field a Table field should contain a sequence of Key or Key-Compact fields which are the "column" names of the data in the table. After the sequence of Key / Key-Compact fields should come a sequence of other ION fields. The ION fields following the Key / Key-Compact fields are matched to the Key / Key-Compact fields by their index. The first field belongs to the column of the first Key / Key-Compact field, the second field belongs to the column of the second Key / Key-Compact field etc.

There is no marker between the "rows" of a table. When the same number of fields as there are Key / Key-Compact fields in the table have been read or written, that is interpreted as an implicit "row" boundary. For instance, if there are 10 Key / Key-Compact fields, then every 10 fields following the Key / Key-Compact fields belong to the same "row".

Tables are a compact way to send tabular data like CSV files, or lists of objects where all the objects are of the same type, and thus have the same Key / Key-Compact fields representing their properties. The resulting size of an ION table compared to the corresponding array of objects formatted as JSON, is often down to 1/3 or even 1/4, and can go even lower, depending on the type of data you are sending across, and the length of the property names in the objects.

Tables can contain both primitive and complex ION fields as values in the rows. Thus, you could even use a Table with nested Object and Table fields to represent a complex object graph more compactly.

A Table can contain a Complex-Type-Id field containing the type (e.g. Java class name) of the rows of the table. If used, the Complex-Type-Id field should be the very first field nested inside the Table field. However, the Complex-Type-Id field is optional.

Object

The Object field is a normal length field meaning it consists of a lead byte, 0..15 length bytes and 0..2^120 value bytes. The number of length bytes is stored in the 4 least significant bits of the lead byte.

Inside an object you can nest other ION fields in any order you like. Thus, an Object is a mixed bag of whatever you want it to be. However, the Object field does impose a certain interpretation of certain fields and their order. This interpretation is explained in the following sections.

The Complex-Type-Id field is intended to specify the type (e.g. Java class name) of an object. If used, the Complex-Type-Id field should be the very first field nested inside the Object field. However, the Complex-Type-Id field is optional. Just like class names are not necessary in JSON, you can write an ION Object without a Complex-Type-Id field.

To mimic object properties (property name + property value pairs) use a Key or Key-Compact field followed by a primitive field. The Key or Key-Compact field represents the property name, and the primitive field represents the property value.

By the time you start writing an Object field you may not know its final length in bytes. To work around that problem simply reserve a number of length bytes that you know for sure will be enough to represent the final length of the object. For instance, if you know for sure that the Object field will be less than 65.536 bytes, just reserve 2 length bytes before you start writing the fields inside the Object. Then, when you have finished writing all the fields inside the Object, jump back up and insert the length into the reserved length bytes.

Of course this strategy means that you need to write the whole ION file to a buffer before you can commit it to disk or send it over the network. However, knowing the length of a field upfront is a big advantage when reading a field, so this is one of the trade-offs we have made between read speed and write speed. Anyways, writing ION data is pretty fast and ION can be very compact compared to other formats (like JSON), so this little write delay is not as big a problem as it would be with other more verbose data formats.

Key

A Key field is a normal length field that represents a property name of an Object or a column name in a Table. You could also use a Key field to represent a key in a hashtable.

A Key field can contain whatever you need it to, but it is common to use a sequence of UTF-8 characters (e.g a property name in a Java class).

Key-Short

A Key-Short field is similar to a Key field except a Key-Short field can only contain up to 15 bytes as value. The length of the field value is encoded directly into the 4 least significant bits of the Key-Short lead byte.

Extended ION Field Encodings

As mentioned earlier in this ION encoding document, ION can contain a set of fields that are encoded using extended encoding. Extended encoding means that the field type id in the lead byte will have the value "Extended" (15). The lead byte of an extended field is followed by 1 or 2 type ID bytes.

If the value of the first byte following the lead byte is between 16 and 127, then it is a single-byte field type id. We have not yet decided how to encode 2-byte field type ids, because ION currently has no extended fields.

The length-length bits of the lead byte (least significant 4 bits) mean the same as for core ION fields. They signal the length in bytes of the length (byte count) of the field value. Extended fields can also come in short and tiny encodings. In these encodings the length-length bits change meaning to the length in bytes of the field value (for extended short encodings) or to contain the value itself (extended tiny encodings). Note, that extended short and extended tiny fields still have a field type id byte following the lead byte.

If and extended field contains length bytes (Extended normal encodings do), the length bytes will follow the field type id byte(s).

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC