ION vs. Other Formats
This text discusses how ION is different from other data formats addressing the same problem of compact, binary data communication. We have compared ION to:
Here is a table summing up the differences. An "Yes(*)" means "Yes - but with limitations". A "No(*)" means "No, but can be bend to support it". If something isn't right in this table, let us know!
|Good at raw bytes||Yes||Yes||Yes||Yes||No|
|Copy (of earlier data)||Yes||No||No||No||No|
|Object / Map||Yes||Yes||Yes||Yes||Yes|
|Schema / Class Id||Yes||No||No||No||No|
|Unspecified length arrays / maps||No||Yes||No||?||Yes|
|Arbitrary hierarchical navigation||Yes||Yes(*)||Yes(*)||Yes||Yes(*)|
|Stream mode reading||Yes||Yes||Yes||Yes||Yes|
|Stream mode writing||Yes(*)||Yes||Yes(*)||Yes||Yes|
|Extendable with new / custom types||Yes||Yes||Yes||Yes||No|
Of these formats ION is most similar to CBOR. ION is also very similar to MessagePack (though less than to CBOR). The basic encoding is so similar that these formats should have comparable read and write speeds for fields. The ION Performance Benchmarks confirm that.
ION's major difference to CBOR and MessagePack is the table data structure. The table can model tabular data similar to a CSV file or database table. An ION table only contains the column names once, followed by the column values for all rows in the table. Tables can also contain nested tables, so tables can also be used to make object graphs more compact (where a parent object can have multiple children of the same kind).
Tables are a very compact way to send arrays of objects of the same kind. As is also shown in the benchmarks, ION tables can be down to 33 - 25% of the same data encoded as JSON (and 40-50% of CBOR and MessagePack).
Not all formats support raw bytes well. By raw bytes I mean a sequence of bytes like a file, or a video frame etc. To include raw bytes in JSON they must be text encoded using either Base64 or Hex encoding. Base64 encoding will make the encoded data take up 4/3 of the original data (one third more), and Hex encoding will make the encoded data take double the amount of the original data. ION, MessagePack, CBOR and Protobuf has no problem including raw bytes.
Both ION and CBOR has UTC Date Time support, but ION's Date Time encoding use 50% or less bytes than the ISO standard (textual) format used by CBOR. Also, ION only supports UTC time - not local time (although you are required to convert to and from UTC yourself).
ION contains a special "Copy" field which enables you to reference an ION field earlier in the same ION data which should be copied at this place in the ION data. For instance, it could be a class name (see later), a long property name, a zip + city object, a large object graph, table or something else. As far as we remember, CBOR and MessagePack can use a special "string back reference" element which can be used to refer to often used strings (e.g. property names of objects of the same kind), but they are not part of their core encodings (as far as we can see).
All of the formats supports arbitrary hierarchical navigation of the encoded data, without first converting it to objects. However, since ION always knows the exact size in bytes of a complex field (a field containing other fields, like an object, array or table), ION can skip over a whole field without having to parse into its nested fields. CBOR, MessagePack and JSON cannot do that. They all require some level of parsing of the contents of a field in order to find the next element at the same hierarchical lever after it (the next sibling).
ION, CBOR, MessagePack and JSON are all self describing, meaning you don't need an external schema to read them. This is essential for a network protocol where intermediate nodes may have to route messages along to other nodes. According to Protobuf's own docs you cannot see where one Protobuf message ends and the next begins, meaning Protobuf is not fully self describing. You can see where the individual Protobuf fields start and end, but not the full message.
The fact that Protobuf is not fully self describing makes it unsuitable as a network protocol message format (although you could route Protobuf messages inside other types of messages). That a data format is self describing also means that it is possible to convert a file of these formats to a textual format (JSON is alread textual) to see what is actually stored in the file.
ION supports several levels of self describing messages. At the most describing level ION can embed schema or class names (Complex Type Ids) inside objects, arrays or tables. This takes up more space of course. You can also just embed a short schema / class id in terms of a shorter number or textual code, and then translate that when reading the ION message.
Schema / class names are optional. You can also just serialize objects with key,value pairs like JSON. ION supports that too. In fact, CBOR, MessagePack and JSON support this level of self describing messages too. A tiny difference is that ION keeps the data type when sending NULL values (e.g. a null int-64 or a null UTF-8 text). Both CBOR, MessagePack and JSON loses the type when transferring NULL values. A null has no type. (but as said - his is not a big thing).
ION also supports a compact level of objects where property names are left out. This is very similar to how ION tables work, where the property names of objects are only listed once. The compact level of objects and tables makes ION very similar to how Protobuf looks encoded. Consequently, this encoding mode also matches Protobuf's performance (faster writes but slower reads than Protobuf). Even if these compact objects do not contain any property names, they are still self describing enough that you can see where fields start and end, plus their data type, without an external schema. You cannot do that with Protobuf (as far as we know).
ION has support for expressing cyclic references between objects. At this point this support is not 100% finalized.
ION is designed to function as a network message format (among other things) for the IAP network protocol. One feature we plan to build into IAP is caching of data related to the IAP connection. For instance, a web server could ask a client to cache a file. Or, an API server could ask a client to cache some data (e.g. a service status) which it returns often. Later in the session the server can then refer to this cached file (any ION field, actually) as part of a new message it sends.
ION's biggest drawback compared to CBOR is that CBOR allows for stream-mode writing of arrays and objects. In stream-mode writing the element contains no information in the beginning about how large the object or array is. Instead the object or array has an end-marker. This stream-mode writing allows CBOR data to be generated and streamed directly out on the network.
Since ION fields all contain the size in bytes of a field right at the beginning of the field, you can only use stream-mode-writing with ION when you know the size of a field ahead of time. This is normally true with primitive fields (e.g. a string or int-64), or even with files read from disk where you know the size ahead of time. But with larger objects generated based on e.g. database queries stream-mode writing is not possible. You will have to buffer up the message before sending it. This can be done reasonably efficiently, so this mostly noticeable with larger messages (4K and up).
Remember, files from disk, which are often larger than 4K, can still be written using stream-mode writing with ION, so this is only an issue with large amounts of generated data (e.g. HTML files put together from templates etc.). However, we have plans to address these issues elsewhere in IAP.
Stream-mode reading is fully possible in ION.