Stream Ops for Java is a data streaming toolkit implemented in Java. Right now Stream Ops only works as an embeddable toolkit you can use inside your own Java applications, but we are working on a client-server version too.
Stream Ops is similar in functionality to Apache Kafka and Apache Pulsar, but the focus on making Stream Ops embeddable led to a different design philosophy. Where Kafka and Pulsar are more of a black box internally, Stream Ops opens up the data streaming engine so you can use the parts of it that matches your specific needs. This gives you a higher degree of flexibility to customize your data streaming pipeline.
Stream Ops is designed to be fully embeddable in your applications. By fully embeddable we mean that you can use Stream Ops internally in a desktop or mobile application, inside a web application, micro service, or other service requiring sequential data storage - typically systems using a variation of a CQRS design.
Stream Ops consists of a small set of small JAR files. Stream Ops was kept small on purpose to minimize the code footprint's impact on mobile apps, desktop apps and micro services that embeds Stream Ops.
Embedded Use Cases
Common embedded use cases for Stream Ops include:
- As an application log (info, warn, error etc.).
- As a data log - e.g records containing measurements or metrics.
Analyzing data on a local machine can in some cases be much faster than interacting with remote databases.
- As an event log - to store and replay the internal state of your app.
- As engine for implementing your own data streaming service (custom protocol, encoding etc.)
It is really nice to only have a single interface to read and write all of these different types of data, which are all just streams of records.
In time Stream Ops will be fully networkable too. Stream Ops will be using the compact IAP protocol which are are also developing. This tutorial will be updated as more of the networked mode is decided and implemented. We prefer not to make too many promises, until we know we can deliver.
Networked Use Cases
Common networked mode use cases for Stream Ops include:
- As a remote data log client / service.
- As a remote event log client / service.
- As a remote application log client / service.
The primary difference between embedded and networked use cases is, that in embedded mode only your application can read and write the data, unless you add a network layer on top yourself. In networked mode, one application can write data for other applications to consume.
Designed for Performance
Performance is a high priority for Stream Ops. In September 2019 we (Nanosai) launched the
One billion records per second data streaming challenge which is a challenge for ourselves to push Stream Ops to be able to process up to one billion records per second in embedded mode. In networked mode, obviously the network IO will be the biggest limitation, but somewhere between 10 and 20 million records per second should be possible.
A lot of the design choices made in Stream Ops are heavily influenced by the requirement for performance. Embedded mode can remove the network overhead in cases where network is not needed. The RION record format is compact, fast to parse, and makes it possible to sub-select columns of records when streaming over a network, just like when selecting columns from a database table. And this only the tip of the iceberg in terms of performance influenced design decisions.
Stream Ops Components
Stream Ops contains (or will contain) the following components:
- Data Streaming Engine
- Stream Processing API
These components will be described in a bit more detail in the following sections.
Data Streaming Engine
The data streaming engine handles the writing of records into a data stream on disk, and later reading of those records again. Records can be read sequentially in the same order they were stored. You can iterate over all, or part of the records in the stream. The data streaming engine contains several components to help you iterate cleverly over the records in the stream, filter out the records you are not interested in, and even to only extract part of the fields of each record.
Stream Processing API
The stream processing API can process the records of a stream on a higher level. For instance, read all the records, convert them to objects, perform calculations and transformations, and finally output some other result. The stream processing API can be used independently of the data streaming engine. Thus, you could use the stream processing API with Kafka or other data streaming technologies.
Please note, that the stream processing API is currently a bit on hold, as some of our most recent analysis of the currently popular designs (functional stream processing, observer based stream processing) show, that these designs are not optimal. We need to analyze these designs in more detail before we decide on a design for the stream processing API. You can follow this analysis as it unfolds in my tutorial Stream Processing API Designs.
Stream Ops GitHub Repository
You can find the Stream Ops for Java GitHub repository here: