Tech and Media Labs
This site uses cookies to improve the user experience.

Stream Processing API Designs

Jakob Jenkov
Last update: 2019-09-11

Stream processing has become quite popular as a technique to process streams of incoming events or records as soon as they are received. There are many benefits of, and use cases for, stream processing (aka data streaming), and therefore many stream processing APIs have been developed to help developers process streams, more easily.

By itself, a large selection of stream processing APIs is positive. However, the design of these APIs have a quite large impact on just how "easy" processing streams with them becomes. This is not obvious from looking at the surface of these APIs, unless you know what to look for. Most of the limitations of the different designs you don't discover until you try to solve various problems with them - and by then it is often too late to change to a different API.

Having burned my own hands on various stream processing API designs, I decided to write this analysis of stream processing API designs to help anyone who is about to venture into stream processing, and who would like a bit of guidance in what to look for when choosing an API.

Functional Stream Processing Designs

Functional stream processing is one of the more well-known stream processing API designs. It is, however, not one of the best designs. Functional stream processing comes in two flavours:

  • Functional Batch Processing
  • Functional Stream Processing

Some of the most well-known implementations using the term functional stream processing are actually not stream processing APIs, but batch processing APIs. I will explain the design of the Java Streams API which, despite its name, is an example of a functional batch processing API.

An example of a truly functional stream processing API is the Kafka Streams API which comes with the Kafka platform. I will explain how the Kafka Streams design is different from the design of the Java Stream API, and also explain its limitations as a stream processing API design.

Functional Batch Processing

Functional batch processing is, as I mentioned above, often called "functional stream processing" although as we will soon see, it is not. The Java Streams API is an example of a functional batch processing API. The design rationale behind functional batch processing is this:

Imagine you have a large collection of objects which you need to iterate and perform various calculations on. For instance, imagine you have to calculate the total sum, average, min and max of a list of objects that contain a price (or maybe even just the prices themselves as numbers). Since the operations to be carried out are very light weight, the heaviest part of performing these calculations on a list of objects is the time taken merely to iterate the objects.

Rather than iterating all the objects once for each operation (sum, average, min, max), you want to iterate the objects only once, and perform all the calculations on these objects during this one iteration. Functional batch processing APIs are designed to enable you to do exactly that.

Functional batch processing APIs are designed so that you configure a set of element listeners and then start the iteration of all the elements. During iteration of all the elements each listener is called once for each element in the collection. It is then up to the listener to perform any operations it need on the given element, and keep track of any information needed in that regard (e.g. sum and number of elements to calculate an average value).

The Java Streams API

The Java Streams API is an example of a functional batch processing API. The documentation refers to these batches as "bounded streams" - but a bounded stream is a batch. A bounded stream has a first element and a last element - and is thus a batch.

I won't go into too much detail about the actual methods available in the Java Stream API. I will focus on what the resulting "structure" of your configured stream processing pipeline looks like.

On the surface, when you have configured a Java Streams API processing pipeline, you will have a processing graph that looks similar to this:

Functional batch processing pipeline - at first glance.

Each element in the source collection (batch) is passed through the chain of non-terminal operations to finally be processed by the terminal operation at the end of the chain. Of course, if elements are "filtered out" before they reach the terminal operation, they won't be processed by it.

However, when you study the "structure" a bit closer, you realize that the various non-terminal operations are functions, and that these functions do not directly call the next function in the chain. Instead, something else calls each function, and pass their return value on to the next step in the chain. Here is a more precise illustration of what is going on:

Functional batch processing pipeline - more precise illustration.

All the non-terminal operations are functions (or classes implementing a functional interface) which are called by the internal iterating component(s) of the Java Stream API. When a terminal operation is called, the internal iteration is started, the non-terminal operations are called and their return values end up in the terminal operation.

Functional Batch Processing - Pros

The advantages of functional batch processing is, that these kinds of APIs tend to be easy and quick to use when you need to do some light transformations / calculations on a collection of objects.

Functional Batch Processing - Cons

For anything beyond the simplest transformations / calculations on a collection of objects, the functional batch processing API design starts to break down.

Terminal Operations are for Batch Processing - Not Stream Processing

First of all, terminal operations have no place in stream processing. The terminal operation typically aggregates the result of the various non-terminal transformations applied to each element during iteration. After iteration finishes, the aggregated result is returned. Returning a final result is only possible because your data has a last element. In true stream processing there is no last element! You don't know when you have received the last element. Thus, terminal operations only make sense in batch processing - not in stream processing.

Chain - Not Graph

When you configure an element processing topology with the Java Stream API (functional batch processing), the result is a chain. In other words, the output of one processing function is the input of the next processing function in the chain. At the end of the chain you have the terminal operation which aggregates and produces the final result.

In real-life stream processing you will often need a processing graph, not just a simple chain.

The reason the functional batch processing uses a chain topology is because the Java Streams API is designed around terminal operations. A terminal operation produces a single, final result. A chain topology is the easiest way to produce a single, final result. All elements flow through the chain and end in the terminal operation.

With a graph topology there could be many different "leaf processors" - each of which could hold final results when the iteration ends. With a graph topology the terminal operation would have to collect the final results from all of these leaf processors and aggregate them into a single, final result. Though possible to implement, this would probably be harder to implement, and the resulting API might be a bit less easy to use. But - it would be much more powerful.

Intended to be Stateless

One of the main ideas in functional programming is, that the functional topologies are intended to be stateless functions that get some input and return a single result.

In real life stream processing, however, you will often need state in your processors. For instance, counting the number of elements that has passed through a processor is useful for monitoring the topology. That count is state.

No Feedback Mechanism

Since processors (non-terminal operations) are intended to be stateless functions, there is no way to provide feedback back up the chain. For instance, in a chain that contains these two operations:

  • Convert from object of class A to object of class B
  • Calculate based on numbers in object of class B.

... there is no way for the second processor to tell the first processor that it doesn't need to convert anymore objects from class A to class B. Let's say, that the second operator only needs the first 10 elements to do its job - then all elements will still be converted from class A to class B by the first processor, even if their values will be ignored by the second processor. A waste of CPU resources.

In real life stream processing you can benefit from a feedback mechanism which feed information back up the processing topology. In many cases, such feedback requires stateful processors.

Observer Based Stream Processing

Observer based stream processing API designs has several advantages over functional stream processing designs. I will here explain what they are.

The Store and Process Design Pattern

For many simpler, and even advanced stream processing requirements, both the functional stream processing and observer based stream processing designs are unnecessarily complex, in my opinion. The store and process design pattern offers a simple, yet scalable alternative.

More coming soon...

Jakob Jenkov

Copyright  Jenkov Aps
Close TOC