Batch Processing vs Stream Processing

Part III of Series: Data-Intensive Application

When we encounter systems with large amounts of data, there are 2 main ways we can crunch that data to transform it into something useful for our organisation.

One of those approaches is batch processing, where the process takes large amounts of accumulated data over a period of time, and applies operations to transform such data into something else.

The second approach is stream processing, where data is flowing through an event stream and processing happens in (near) real time.

Both approaches are important, as they address different needs of an organisation.

Batch Processing

This is the classic approach that has existed probably since computing existed. Before digital computers were invented, punch card tabulating machines implemented semi-mechanised batch processing to aggregate statistics from large inputs.

In modern applications, batch processing is performed in either data warehouses, or aggregating large amounts of data using MapReduce operations.

Either way the goal of this processing is to take data at rest, and transform it in a way that produces some utility, for example, calculating aggregation of revenues, sales projections, betting statistics, etc.

The traditional data-warehousing approach typically uses a lot of processing power in one place to crunch large amounts of data. There are of course distributed data warehouses, but the strategy is fairly similar.

The MapReduce approach in tools such as Hadoop or Spark, created new ways to store data in a distributed way, allowing distributed systems to leverage distributed file systems in order to achieve better performance and scalability.

Nowadays we can use a small Spark cluster to process huge amounts of data in minutes, where traditional approaches could take days.

A typical setup for Apache Spark could look as follows:

Apache Spark simplified

In this setup, the data processing is planned and distributed across all the nodes in the cluster, allowing for super fast batch processing.

In summary, Spark partitions the data in chunks and sends it to the executors to perform operations in parallel.

This approach is still very relevant in today’s world, and given that we seem to be collecting more and more data, it is vital that these systems evolve with the processing needs.

Stream Processing

Stream processing on the other hand is a more recent approach and it’s based on the fact that (unless something unexpected happens), data won’t stop flowing in, so datasets are never complete.

Batch processing needs to logically split data in order to set boundaries and apply transformations.

In stream processing, we work with the premise that datasets are not complete, and data will continue to enter the system in a constant manner, so we need to process data as it comes.

The main benefit of this is that we get immediate feedback. The transformations we apply are processed right away, and we can see the results instantly. This setup also could let us get away with having less powerful machines, as processing incoming smaller datasets may require less resources per node.

Transmission

In batch processing, the unit of storage is a file, which contains a series of items. When processing these, such files get broken down into items for individual processing.

In stream processing, these items arrive individually and they’re commonly known as “events”, but they’re effectively the same, a self-contained, immutable object which stores the details of what happened, usually along with a timestamp.

Hence, event streaming refers to the transmission of these events, usually generated by a “producer” and read by “subscribers” or “consumers”.

A typical event could be the action performed by a user in an online shop, for example placing an order.

Messaging systems

Typically these events are transmitted using a messaging system, such as Kafka. A producer will send a message containing the event into the messaging system, and this would be responsible for pushing those events to the consumers. This is also known as the pub/sub model.

Message log

In a traditional streaming system we will find a data structure right in the middle typically called the log. This tends to be a queue that stores all the incoming events as they arrive.

message log

A log is an append-only sequence of records in disk. The producer appends a new event, and the consumer keeps up with newly appended events in order.

To scale this setup further, some systems implement partitioned logs, where different partitions can be hosted in different machines, making each partition a separate log that can be read and written independently from each other. This is the main idea behind Kafka topics.

That is mostly what you find in event-driven architectural design.

Where is it relevant?

Stream processing has many use cases and it plays a vital role in modern architectures. Some of the use cases are:

  • Fraud detection. Real type fraud detection has become the staple of modern architectures, and it plays an important role in every organisation. Being able to get real time feedback on potential fraud can save your organisation lots of trouble.
  • Trading systems. Stock trading nowadays is by definition a stream processing practice. Being able to analyse price changes of financial markets in real time is without doubt a critical utility.
  • Military and intelligence. These systems track potentially dangerous activities in real time and are critical in the prevention of terrorist attacks and other illicit activities.

At the same time, the real time processing of any system analytics is also vital for modern systems, where reacting to faults rapidly is very important.

Conclusion

Both batch and stream processing play a vital role in everyday applications, each targeting a specific use case.

However, nowadays the tendency is to go more and more real time. With systems like Kafka getting ever more powerful, and processing power and storage getting cheaper, companies are starting more often than not, to design their systems with data streaming as their integral backbone and ditch almost entirely the batch process.

With this section, I conclude my series of Data-Intensive applications.

I hope you’ve enjoyed it, and if you have any questions just reach out.

Happy architecting? 🤔