What Is Stream Processing? Introduction to OSS Engines

What is stream processing?

For example, imagine using sensors to sort T-shirts moving along a conveyor belt by color and pack them into boxes.

In this case, “streaming data” and “stream processing” mean the following:

  • Streaming data: T-shirt images sent by the sensor
  • Stream processing: AI determines the color in real time based on the T-shirt images received from the sensor

Stream processing vs batch processing

This section compares stream processing and batch processing to clarify the characteristics of stream processing.

The difference between stream processing and batch processing is the difference in the time axis.

Stream processing Batch processing
Goal Prioritizes real time Prioritizes throughput
Processing timing When streaming data occurs When hardware resources are available
Processing time A few milliseconds to a few seconds A few minutes to a few hours
Use cases Credit card fraud detection
Real-time game world rankings
IoT device data analysis
Nightly batch jobs
Monthly store processing

As these use cases show, stream processing is used when data generated endlessly over time must be processed in real time.

For example, fraudulent credit card use must be detected and stopped as quickly as possible, even one second earlier. Processing it next month would be too late.

What is streaming data?

Streaming data is also called an “event stream” or “data stream.”

Besides being unbounded, streaming data has the following three characteristics:

  • Ordered
  • Immutable data records
  • Replayable (a desirable characteristic)

Ordered

Events, which are the individual records in streaming data, have an order.

For example, event 1, “transfer 2,000,000 won as salary,” and event 2, “withdraw 500,000 won,” have an order. If event 2 is processed first for an account with a balance of 0 won, the account will become overdrawn.

Immutable data records

Once an event occurs, it cannot be deleted or changed.

For example, the event “withdraw 500,000 won” cannot be deleted or modified later. To cancel this event, you create a new event such as “deposit 500,000 won.”

Replayable

Because streaming data is ordered and consists of immutable data records, it can be reproduced.

For example, from the following event list, you can find the current balance as well as the balance at any point in time.

  1. Event 1: Open an account with a balance of 0 won
  2. Event 2: Deposit 500,000 won
  3. Event 3: Deposit 1,000,000 won
  4. Event 4: Withdraw 500,000 won
  5. Event 5: Deposit 500,000 won
  6. Event 6: Withdraw 1,000,000 won

Reconstructing data in a past state from an event list is called materialization. If you materialize the “balance data” at the end of event 4, the result is “1,000,000 won.”

Time concepts in stream processing

Because stream processing can process streaming data based on time, it is important to define accurate time concepts. Stream processing uses the following three time concepts:

  • Event Time
  • Ingest Time
  • Processing Time

Stream processing time

Event Time

Event Time is used to analyze streaming data.

  • Peak access times for a website
  • Product types whose sales increase by hour

Ingest Time

Because stream processing runs in real time, “Event Time” and “Ingest Time” are usually almost the same.

If arrival is delayed by a network failure or a similar issue, a gap appears between “Event Time” and “Ingest Time.”

Processing Time

If there is a large gap between “Processing Time” and “Ingest Time,” the application side may not be keeping up with stream processing.

Time Window

In stream processing, you can operate on streaming data using time-based windows.

In a time window, the window size is determined based on time, such as processing records accumulated every 10 seconds.

Time windows are mainly divided into the following three types according to how often the window moves, or advances:

Window movement frequency (advance interval) Duplicate processing
Tumbling window Same as the window size. No
Hopping window Smaller than the window size. Yes
Sliding window Every time an event in the window changes. May occur

Tumbling window

A tumbling window is a window type where the movement frequency and the window size are the same.

Its characteristic is that processing for events does not overlap.

Tumbling window

Hopping window

A hopping window is a window type where the window size is larger than the movement frequency.

Its characteristic is that processing for events overlaps.

Hopping window

Sliding window

A sliding window is a window type where movement occurs whenever an event within the window changes.

When no event in the window changes, no operation is performed. In the case of an SQL query, no result is output.

Sliding window

OSS and services for stream processing

Stream processing can be performed using the following two components:

  • A distributed streaming data source, which queues events
  • A distributed stream processing application, which performs stream processing on events in the queue

Distributed streaming data sources

Distributed streaming data sources mainly use the following Pub/Sub messaging queues:

  • Apache Kafka
  • Amazon Kinesis Data Streams

Engines used in distributed stream processing applications

Distributed stream processing applications mainly use the following distributed processing engines:

  • Kafka Streams (Apache Kafka Streams API)
  • Kinesis Data Analytics
  • Apache Flink
  • Apache Spark Streaming
  • Apache Storm
  • Apache Samza

apache-streaming-technologies
https://databaseline.tech/an-overview-of-apache-streaming-technologies/

Company stream processing examples

Examples of companies using stream processing with Apache Kafka and Kafka Streams include the following: