What Is Stream Processing? Introduction to OSS Engines

What is stream processing?

**Stream processing** continuously processes streaming data, which is data generated endlessly over time, in real time.

For example, imagine using sensors to sort T-shirts moving along a conveyor belt by color and pack them into boxes.

In this case, “streaming data” and “stream processing” mean the following:

Streaming data: T-shirt images sent by the sensor
Stream processing: AI determines the color in real time based on the T-shirt images received from the sensor

Stream processing vs batch processing

This section compares stream processing and batch processing to clarify the characteristics of stream processing.

The difference between stream processing and batch processing is the difference in the time axis.

	Stream processing	Batch processing
Goal	Prioritizes real time	Prioritizes throughput
Processing timing	When streaming data occurs	When hardware resources are available
Processing time	A few milliseconds to a few seconds	A few minutes to a few hours
Use cases	Credit card fraud detection Real-time game world rankings IoT device data analysis	Nightly batch jobs Monthly store processing

As these use cases show, stream processing is used when data generated endlessly over time must be processed in real time.

For example, fraudulent credit card use must be detected and stopped as quickly as possible, even one second earlier. Processing it next month would be too late.

What is streaming data?

**Streaming data** is an unbounded data set. "Unbounded" here means that new records continue to arrive over time.

Streaming data is also called an “event stream” or “data stream.”

Besides being unbounded, streaming data has the following three characteristics:

Ordered
Immutable data records
Replayable (a desirable characteristic)

Ordered

Events, which are the individual records in streaming data, have an order.

For example, event 1, “transfer 2,000,000 won as salary,” and event 2, “withdraw 500,000 won,” have an order. If event 2 is processed first for an account with a balance of 0 won, the account will become overdrawn.

Immutable data records

Once an event occurs, it cannot be deleted or changed.

For example, the event “withdraw 500,000 won” cannot be deleted or modified later. To cancel this event, you create a new event such as “deposit 500,000 won.”

Replayable

Because streaming data is ordered and consists of immutable data records, it can be reproduced.

For example, from the following event list, you can find the current balance as well as the balance at any point in time.

Event 1: Open an account with a balance of 0 won
Event 2: Deposit 500,000 won
Event 3: Deposit 1,000,000 won
Event 4: Withdraw 500,000 won
Event 5: Deposit 500,000 won
Event 6: Withdraw 1,000,000 won

Reconstructing data in a past state from an event list is called materialization. If you materialize the “balance data” at the end of event 4, the result is “1,000,000 won.”

Time concepts in stream processing

Because stream processing can process streaming data based on time, it is important to define accurate time concepts. Stream processing uses the following three time concepts:

Event Time
Ingest Time
Processing Time

Stream processing time

Event Time

**Event Time** is the time when the record was created as the event occurred.

Event Time is used to analyze streaming data.

Peak access times for a website
Product types whose sales increase by hour

Ingest Time

**Ingest Time** is the time when an event arrives at the streaming data source.

Because stream processing runs in real time, “Event Time” and “Ingest Time” are usually almost the same.

If arrival is delayed by a network failure or a similar issue, a gap appears between “Event Time” and “Ingest Time.”

Processing Time

**Processing Time** is the time when the stream processing application actually processes the event.

If there is a large gap between “Processing Time” and “Ingest Time,” the application side may not be keeping up with stream processing.

Time Window

In stream processing, you can operate on streaming data using time-based windows.

A window operation works on records within a limited range. This limited range is called the window size.

In a time window, the window size is determined based on time, such as processing records accumulated every 10 seconds.

Time windows are mainly divided into the following three types according to how often the window moves, or advances:

	Window movement frequency (advance interval)	Duplicate processing
Tumbling window	Same as the window size.	No
Hopping window	Smaller than the window size.	Yes
Sliding window	Every time an event in the window changes.	May occur

Tumbling window

A tumbling window is a window type where the movement frequency and the window size are the same.

Its characteristic is that processing for events does not overlap.

Tumbling window

Hopping window

A hopping window is a window type where the window size is larger than the movement frequency.

Its characteristic is that processing for events overlaps.

Hopping window

Sliding window

A sliding window is a window type where movement occurs whenever an event within the window changes.

When no event in the window changes, no operation is performed. In the case of an SQL query, no result is output.

Sliding window

OSS and services for stream processing

Stream processing can be performed using the following two components:

A distributed streaming data source, which queues events
A distributed stream processing application, which performs stream processing on events in the queue

Distributed streaming data sources

Distributed streaming data sources mainly use the following Pub/Sub messaging queues:

Apache Kafka
Amazon Kinesis Data Streams

Engines used in distributed stream processing applications

Distributed stream processing applications mainly use the following distributed processing engines:

Kafka Streams (Apache Kafka Streams API)
Kinesis Data Analytics
Apache Flink
Apache Spark Streaming
Apache Storm
Apache Samza

https://databaseline.tech/an-overview-of-apache-streaming-technologies/

Company stream processing examples

Examples of companies using stream processing with Apache Kafka and Kafka Streams include the following: