Designing Data-Intensive Applications | Chapter 10. Batch Processing

Presenters: Jungsoo Kim, Sumin Park

Presentation materials

Alternatives to batch processing

MapReduce

  • Advantage: a fairly simple and clearly abstracted model on top of a distributed file system.
  • Disadvantage: complex operations are very difficult.

Materializing intermediate state in MapReduce

This is a method of storing, or materializing, the output (intermediate state) of each task in a batch processing workflow as files in a distributed file system.

(Complex operations are divided into multiple MapReduce jobs.)

  • Distributed file system: the mechanism that stores task output as files.
    • These files are used as input for other tasks, mainly to pass data between tasks.
  • Intermediate state: the state stored in those files.
    • Complex workflows can have a large number of intermediate states.
  • Materialization: the process of storing intermediate state as files.
    • Rather than generating a result only when a request arrives, it precomputes the result of a specific operation.

Disadvantages compared with Unix pipes:

  • A downstream task can start only after the preceding task has completed.
    • In Unix pipes, tasks proceed at the same time.
      • Data is consumed as soon as it is produced.
    • The overall workflow takes longer to complete.
  • Duplicate mapper work
    • A mapper reads files created by a reducer, then partitions and sorts them.
      • This causes unnecessary I/O.
      • It can be addressed by letting the reducer also perform the mapper’s role.
  • Intermediate state files are replicated across all machines in the distributed file system.
    • Although intermediate state files are temporary data, replication is an excessive cost caused by using a distributed file system.

Dataflow Engine

Engines for distributed batch processing appeared to solve MapReduce’s problems.

They model the flow of data explicitly.

  • Spark: framework
  • Tez: lightweight library
  • Flink: framework

Framework: provides its own network communication layer, scheduler, APIs, and so on.

https://trends.google.com/trends/explore?cat=32&q=Spark,Tez,Flink

Stream-table join

Characteristics:

  • A workflow is handled as a single job rather than being split into independent subtasks.
  • Like MapReduce, it calls user-defined functions in a single thread and processes records one at a time.
  • It partitions input data for parallel processing.
  • The output of one function is passed to another function as input through network copies.

Operator:

  • A function used for data processing.

Advantages over the MapReduce model:

  • High-cost operations such as sorting can be performed only when needed.
    • In MapReduce, sorting always occurs between map and reduce by default.
  • Locality optimization is possible.
    • Because the workflow is explicit, the engine can know which data is needed at which point.
    • Data can be passed between functions through shared memory instead of network copies. This reduces network I/O.
  • Intermediate state is stored on local disk, reducing I/O consumption.
    • If it is stored in HDFS (Hadoop Distributed File System), replication across multiple servers is required.
  • Operators run as soon as input appears.
    • They do not have to wait for the entire preceding function to finish.
  • A JVM does not need to be started for every operator execution.
    • MapReduce starts a new JVM for each task.

Dataflow engines can implement the same operations as MapReduce workflows and are generally much faster thanks to several optimizations.
Because operators generalize maps and reduces, a MapReduce workflow can often be converted into a dataflow workflow with little change. Check compatibility before conversion.

Fault tolerance

MapReduce can easily provide fault tolerance and durability because it materializes all intermediate state.
Because dataflow engines do not materialize intermediate state in HDFS, they use another method. This does not mean they do not use HDFS; the original data is in HDFS.

  • If input data remains locally, the engine recomputes it. If not, it fetches the original data from HDFS and computes it.

The engine tracks operations for recomputation.

  • It must track which input partition was used and which operator was used.
  • Spark uses the RDD (Resilient Distributed Dataset) abstraction.
    • RDD: an immutable distributed collection of multiple data elements.
  • Flink checkpoints operator state.

Idempotence must be guaranteed during recomputation.

Things to watch for when guaranteeing idempotence:

  • If a hash table is used for data lookup, the hash table does not guarantee a particular order.
  • If operation logic depends on random numbers, control randomness by using a fixed seed when random numbers are required.
  • If system time or external data is used, remove the causes that make idempotence impossible.

However, if the intermediate data is small or the operation uses a lot of CPU resources, materializing intermediate data may be more effective than recomputing it.

Discussion on materialization

Summary:

  • It is not necessary to wait for the preceding task to complete. Pipeline execution is possible.
  • Not all intermediate state has to be stored in HDFS.

Graphs and iterative processing

The need for batch processing on graph data models is growing.

  • PageRank: an algorithm for measuring the popularity of web pages.

In a dataflow engine, the data itself consists of typical relational tuples. The data flow from one operator to another is structured as a graph.

  • Operators are generally arranged as a directed acyclic graph.
  • The name “dataflow” can be misleading because it may sound as if it is not a graph.

Because data can be represented as relational tuples, such as edges and vertices, algorithms can be implemented in an iterative form.

  • This can be implemented with MapReduce, but it is inefficient.
  • MapReduce does not consider the iterative nature of algorithms.
    • It processes data as a one-time operation.
    • For example, it cannot naturally express logic such as “repeat this task until a condition is satisfied.”

The Pregel processing model

The bulk synchronous parallel (BSP) computation model is widely used as a way to optimize batch graph processing.

BSP implementations:

  • Apache Giraph
  • Spark GraphX
  • Flink Gelly API

BSP is also called the Pregel model. It was introduced in Google’s Pregel paper as a graph processing methodology and then became widely adopted.

Pregel = a graph parallel/distributed processing framework.

  • Handling large-scale graphs is very difficult. Even with an optimal algorithm, processing cost commonly grows exponentially.
  • It is used for distributed graph processing.
  • Because Pregel itself is a very broad topic, detailed explanations of Pregel and the book’s related content are omitted here.

High-level APIs and languages

Development of distributed batch processing engines:

  • They have become robust enough to store and process petabytes of data (1 PB = 1,000 TB) on clusters made up of tens of thousands of machines.
  • As the technology matured, efforts shifted toward expanding the range of problems that can be solved with it.

Spark and Flink also have their own high-level dataflow APIs.

  • High-level APIs reduce the amount of code required to implement operations.
  • They also support interactive use, allowing users to check code behavior immediately.

Conclusion: high-level APIs can improve system productivity and make work more efficient.

Moving toward declarative query languages

For operations that require joins, expressing joins as relational operators has the advantage that the framework can automatically decide which join algorithm is appropriate, rather than requiring the user to write join code. This is similar to joins in an RDB.

Hive, Spark, and Flink include these query optimizers.

  • Users do not need to know every join algorithm or decide which algorithm to use.
  • Through a declarative approach, the optimizer determines the best execution method.
  • MapReduce has the advantage that users can write code directly. However, it also has disadvantages such as CPU resource overhead. Understand these as characteristics of MapReduce and dataflow engines respectively.

Specialization for supporting diverse domains

There are many common cases where standardized processing patterns repeatedly appear. Therefore, implementing reusable common building blocks is important.

Batch processing is also becoming more important in statistics and numerical algorithms.

  • Statistics and numerical algorithms are needed to build machine learning applications such as classification and recommendation systems.

Technologies that provide reusable implementations in these fields:

  • Mahout: provides implementations of various machine learning algorithms that run on MapReduce, Spark, and Flink.
  • MADlib: an open source library that adds machine learning features to relational MPP (Massively Parallel Processing) databases.
  • K-nearest neighbor algorithm: a type of similarity search algorithm that finds items close to a given item in multidimensional space.
    • It is important in genetic analysis algorithms.

Batch processing systems have both built-in features and high-level declarative operators, so they are used to run algorithms required across a wide range of domains in a distributed way.