Apache Spark

kc@example.com (kc kim) — Sat, 24 Dec 2022 10:22:54 +0900

Apache Spark Overview

Apache Spark is a distributed processing framework for cluster computing that processes large-scale data such as big data and machine learning workloads. Spark development began in 2009 at AMPLab at the University of California, Berkeley, by Mate Zaharia, who was also a Hadoop committer. It is now managed and developed as one of the top-level projects of the Apache Software Foundation.

Spark was developed to improve the slow processing speed of traditional MapReduce and to support a more flexible processing style that is not bound to repeatedly using Map and Reduce.

Spark can run independently as a distributed processing framework, so it is attracting attention as a post-Hadoop technology. At the same time, it can also be used as a replacement for MapReduce within the Hadoop core system, which consists of MapReduce, HDFS, YARN, and related components.

Key Features of Apache Spark

Spark’s major features include the ability to easily program flexible processing models using the concise APIs provided by Spark, and the ability to process large-scale data in far less time than traditional MapReduce.

In traditional MapReduce, Map and Reduce had to be performed as one processing model, so applications running on Hadoop had to be developed according to that style. This made it difficult to develop flexible processing models.

MapReduce also writes processing results to disk after each Map and Reduce operation, making it difficult to improve processing speed. In contrast, Spark can run multiple Map operations consecutively on datasets loaded into memory (RDDs), reduce the results, and then run the next Map operation on the same dataset while keeping it in memory rather than writing it to disk. For this reason, Spark can sometimes achieve more than 100 times faster processing than MapReduce, although it can also write results to disk like traditional MapReduce.

Spark’s features are as follows.

Speed
- Fast in-memory processing
Ease of Use
- Ease of use through support for various languages such as Java, Scala, Python, R, and SQL
Generality
- Provides various components such as SQL, Streaming, machine learning, and graph computation
Run Everywhere
- Can run on various clusters such as YARN, Mesos, and Kubernetes
- Supports various file formats and storage systems such as HDFS, Cassandra, and HBase

Apache Spark Component Structure

Spark is a distributed processing framework consisting of the following components.

Spark Core, including Scala, Java, Python, and R APIs
Spark SQL + DataFrames
Spark Streaming
MLlib
GraphX

Spark Core

Spark keeps data to be processed in the form of RDDs (Resilient Distributed Datasets).
An RDD is an immutable collection that can be executed in parallel and is distributed across computers.
In the Spark programming model, processing is performed by applying various methods provided by Spark Core to these RDDs. When developers manipulate RDDs through the APIs provided by Spark Core, they can perform distributed processing without being conscious of distributed data.
This is one of Spark’s strengths: it makes flexible processing easy to program.
Spark Core APIs are provided not only for Scala, Spark’s development language, but also for Java, Python, and R as standard APIs. Some third-party libraries also provide Spark API access from Clojure, a functional language that runs on the JVM like Scala, and APIs for other languages are expected to continue increasing.

Spark SQL + DataFrames

In addition to manipulating RDDs through Spark APIs, Spark can also use a SQL-like language called Spark SQL to manipulate abstract datasets called DataFrames, which have named columns like database tables.
This is an interface that allows users who have not learned languages such as Scala, Java, Python, or R to process data with Spark if they have SQL knowledge.

Spark Streaming

Spark Streaming is an engine that provides real-time distributed processing for streaming data continuously sent to Spark.
Apache Storm is a similar framework for processing streaming data. While Apache Storm is specialized for streaming data processing, Spark Streaming is Spark’s engine for real-time data processing.
Apache Flink is another streaming processing framework. Because it can also perform batch processing and includes machine learning and graph processing libraries, it has a structure quite similar to Spark and is considered a Spark competitor.

Apache Storm
http://storm.apache.org/

Apache Flink
http://flink.apache.org/introduction.html

MLlib

MLlib is Spark’s machine learning library. It allows users to write programs that perform machine learning using Spark’s flexible processing style.
Before this, Mahout existed as software for machine learning in cooperation with Hadoop, but Hadoop + Mahout required machine learning programs to be written using the MapReduce programming model, which caused slower processing.
Spark can process faster than Hadoop, and machine learning using Spark and MLlib is attracting attention because it is efficient.

Apache Mahout
http://mahout.apache.org/

GraphX

GraphX provides APIs for parallel processing of graph data through Spark.
It enables parallel processing of graph data with Spark’s fast processing speed.

Some components described above do not include storage components. Spark can use various existing storage systems for reading and writing. The following are some storage systems that can be integrated with Spark, including through third-party libraries.

HDFS, Cassandra, HBase, S3, MongoDB, Couchbase, Riak, Neo4j, OrientDB

Readable data sources also vary widely, from files such as CSV and XML to search engines such as Solr and Elasticsearch.

List of packages for integration between Spark and various data sources
https://spark-packages.org/?q=tags%3A%22Data%20Sources%22

In addition to packages that enable integration with various data sources, various packages for extending the existing Spark ecosystem are also provided. These packages are published as Spark Packages at the following site.
https://spark-packages.org/

Apache Spark Operating Environment

Spark officially supports the following OS environments. Java must also be installed to run Spark.

Major Linux distributions
Windows
Linux
MacOSX

Versions supported by the APIs provided by Spark are as follows.

Java 8, 11, 17 (versions below Java 8u201 are deprecated in Spark 3.2.0)
Scala 2.12, 2.13 (Spark 3.3.0 must use a compatible Scala 2.12.x version)
Python 3.7 or later (with Python 3.9, Apache Arrow and pandas UDFs might not work)
R 3.5 or later

Apache Spark License Type

Spark is one of Apache’s top-level projects.
The license is Apache License 2.0, and users are not restricted in using, distributing, modifying, or distributing derivative versions of the software.

Apache Spark Reference Information

It is provided by Databricks, a company started by Spark developers.

PySpark Concepts and Key Features

kc@example.com (kc kim) — Fri, 06 Jan 2023 12:36:13 +0900

What Is PySpark?

PySpark is the Python API for Apache Spark, an open source distributed computing framework and set of libraries for real-time large-scale data processing. If you are already familiar with Python and libraries such as Pandas, PySpark is a good language for learning how to build more scalable analytics and pipelines.

Apache Spark is basically a compute engine that works with huge datasets by processing them in parallel and batch systems. Spark is written in Scala, and PySpark was released to support collaboration between Spark and Python. In addition to providing APIs for Spark, PySpark uses the Py4J library to support interfaces with RDDs (Resilient Distributed Datasets).

The main data type used in PySpark is the Spark DataFrame. This object can be thought of as a table distributed across a cluster, and it has functionality similar to data frames in R and Pandas. To perform distributed computation with PySpark, you must run operations on Spark DataFrames rather than other Python data types.

One of the main differences between Pandas and Spark DataFrames is eager execution versus lazy execution. In PySpark, operations are delayed until results are actually requested in the pipeline. For example, you can specify a job that loads a dataset from Amazon S3 and applies multiple transformations to a DataFrame, but those operations are not applied immediately. Instead, the transformation graph is recorded, and when the data is actually needed, such as when writing results back to S3, the transformations are applied as a single pipeline job. This approach is used to avoid bringing the entire DataFrame into memory and to enable more effective processing across a cluster of systems. With Pandas DataFrames, everything is brought into memory and all Pandas operations are applied immediately.

PySpark Features and Libraries

Py4J is a widely used library integrated into PySpark that enables Python to dynamically interface with JVM (Java Virtual Machine) objects. PySpark provides many libraries for writing efficient programs. It is also compatible with various external libraries, including the following.

PySparkSQL

PySparkSQL is a PySpark library that applies SQL-like analysis to large volumes of structured or semi-structured data. SQL queries can also be used with PySparkSQL.

MLlib

MLlib is the wrapper machine learning (ML) library for PySpark and Spark. MLlib supports many machine learning algorithms for classification, regression, clustering, collaborative filtering, dimensionality reduction, and basic optimization primitives.

GraphFrames

GraphFrames is a graph processing library that provides a set of APIs for efficiently performing graph analysis using PySpark Core and PySparkSQL. It is optimized for fast distributed computing.

Conclusion

For data engineers who know Python but not Scala, PySpark is much easier to use than pure Spark, but it also has drawbacks. PySpark errors show both Java stack trace errors and references to Python code, so debugging PySpark applications can be very difficult.

Spark includes more processing overhead and more complex setup than other data processing options. Ray and Dask have emerged recently. Because Dask is a pure Python framework, most data engineers can use Dask immediately.

devkuma – Apache Spark

Apache Spark

Apache Spark Overview

Key Features of Apache Spark

Apache Spark Component Structure

Spark Core

Spark SQL + DataFrames

Spark Streaming

MLlib

GraphX

Apache Spark Operating Environment

Apache Spark License Type

Apache Spark Reference Information

PySpark Concepts and Key Features

What Is PySpark?

PySpark Features and Libraries

PySparkSQL

MLlib

GraphFrames

Conclusion

References