<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>devkuma – BigData</title>
    <link>https://www.devkuma.com/en/tags/bigdata/</link>
    <image>
      <url>https://www.devkuma.com/en/tags/bigdata/logo/180x180.jpg</url>
      <title>BigData</title>
      <link>https://www.devkuma.com/en/tags/bigdata/</link>
    </image>
    <description>Recent content in BigData on devkuma</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <managingEditor>kc@example.com (kc kim)</managingEditor>
    <webMaster>kc@example.com (kc kim)</webMaster>
    <copyright>The devkuma</copyright>
    
	  <atom:link href="https://www.devkuma.com/en/tags/bigdata/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Apache Hadoop</title>
      <link>https://www.devkuma.com/en/docs/apache-hadoop/</link>
      <pubDate>Sat, 24 Dec 2022 10:22:54 +0900</pubDate>
      <author>kc@example.com (kc kim)</author>
      <guid>https://www.devkuma.com/en/docs/apache-hadoop/</guid>
      <description>
        
        
        &lt;h2 id=&#34;apache-hadoop-overview&#34;&gt;Apache Hadoop Overview&lt;/h2&gt;
&lt;p&gt;Hadoop is a framework implemented in Java for distributed storage and analysis of large-scale data.
Hadoop originated from Google&amp;rsquo;s MapReduce and Google File System, which were distributed processing foundations for efficiently processing large amounts of data.&lt;/p&gt;
&lt;p&gt;Google published papers about these systems in 2004, and Doug Cutting and Mike Cafarella developed Hadoop based on them. The name Hadoop came from the name Doug&amp;rsquo;s son gave to a yellow elephant stuffed toy. It was adopted because it had no meaning, was simple, and was not used elsewhere. The yellow elephant is also Hadoop&amp;rsquo;s mascot.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.devkuma.com/docs/apache-hadoop/apache-hadoop.png&#34; alt=&#34;Hadoop character&#34;&gt;&lt;/p&gt;
&lt;p&gt;Because Hadoop is a distributed processing platform, each process is divided across machines in a cluster (Map), and the results processed by each machine are aggregated (Reduce) to obtain the final result.&lt;br&gt;
Recently, demand for data mining has increased, such as extracting target data from large amounts of data (BigData) or reading trends from stored data. There is also growing demand not only to process BigData, but also to produce such information in a shorter time.&lt;br&gt;
Previously, dedicated products such as data warehouses were needed to process BigData. Hadoop makes this kind of data processing possible by connecting multiple ordinary server machines together, or scaling out.&lt;/p&gt;
&lt;p&gt;Although a Hadoop system consists of multiple servers, distribution across multiple machines increases system flexibility. To improve processing performance, you only need to add systems to the Hadoop cluster. Because a Hadoop cluster can be composed of ordinary server machines, hardware procurement is easy. On the software side, a Hadoop cluster can be scaled up simply by installing and configuring Hadoop on servers added to the cluster. These characteristics make it highly scalable in both hardware and software.&lt;/p&gt;
&lt;p&gt;Because cloud services now make it easy to start multiple servers, a Hadoop cluster can be built in the cloud only when data processing is needed. If performance is insufficient, servers can be added; if resources remain, servers can be reduced; and when a job is complete, all machines in the Hadoop cluster can be released. For this reason, Hadoop is expected to be used in more and more scenarios.&lt;/p&gt;
&lt;p&gt;Until Hadoop version 1, MapReduce was the only parallel processing framework, but from Hadoop version 2, other parallel processing frameworks such as Storm, Spark, Tez, and Impala became available. Interfaces for processing data in Hadoop other than MapReduce (Java) also increased. For example, through Hive and Pig running on Impala and Tez, users can access data using queries almost equivalent to familiar SQL. In addition, Storm and Spark enable real-time data processing through streaming, making it possible to use Hadoop systems even with data outside HDFS.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-features&#34;&gt;Hadoop Features&lt;/h2&gt;
&lt;p&gt;Hadoop consists of the following four core modules.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hadoop Distributed File System (HDFS)&lt;/li&gt;
&lt;li&gt;Hadoop MapReduce&lt;/li&gt;
&lt;li&gt;Hadoop Common&lt;/li&gt;
&lt;li&gt;Hadoop YARN&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also the following modules that are separate Hadoop projects.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Apache Ozone&lt;/li&gt;
&lt;li&gt;Apache Submarine&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;hdfs-hadoop-distributed-file-system&#34;&gt;HDFS (Hadoop Distributed File System)&lt;/h3&gt;
&lt;p&gt;HDFS is Hadoop&amp;rsquo;s own distributed file system. To users it appears as one large file system, but it stores files across nodes. To prevent data loss if one node fails, the same data is stored on three nodes by default.&lt;/p&gt;
&lt;h3 id=&#34;mapreduce&#34;&gt;MapReduce&lt;/h3&gt;
&lt;p&gt;MapReduce is a framework for processing distributed data in parallel. In the Map step, each slave node processes its data, and in the Reduce step, the processing results distributed and executed across multiple nodes in the Map step are aggregated.&lt;/p&gt;
&lt;h3 id=&#34;hadoop-common&#34;&gt;Hadoop Common&lt;/h3&gt;
&lt;p&gt;Hadoop Common is a set of utilities that support Hadoop functionality.&lt;/p&gt;
&lt;h3 id=&#34;yarn-yet-another-resource-negotiator&#34;&gt;YARN (Yet Another Resource Negotiator)&lt;/h3&gt;
&lt;p&gt;Until Hadoop version 1, YARN was not an independent component, but in Hadoop version 2 it became an independent module dedicated to resource management. It can manage MapReduce resources and job scheduling, as well as resources for other distributed processing frameworks such as Giraph, Storm, Spark, Tez, and Impala.&lt;/p&gt;
&lt;h3 id=&#34;apache-ozone&#34;&gt;Apache Ozone&lt;/h3&gt;
&lt;p&gt;Apache Ozone is a project for implementing distributed object storage in Hadoop. It is designed to scale to hundreds of billions of files and blocks, and also supports operation in container environments such as YARN and Kubernetes. It can be accessed using multiple protocols, including S3 and the Hadoop File System API. It was originally a Hadoop subproject, but became an independent Apache top-level project.&lt;/p&gt;
&lt;h3 id=&#34;apache-submarine&#34;&gt;Apache Submarine&lt;/h3&gt;
&lt;p&gt;Apache Submarine is a project that enables deep learning applications such as TensorFlow, PyTorch, and MxNet to run on resource management platforms such as YARN. It was originally a Hadoop subproject, but became an independent Apache top-level project. It can be used with Hadoop 2.7.3 or later.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-use-cases&#34;&gt;Hadoop Use Cases&lt;/h2&gt;
&lt;p&gt;Hadoop can use Apache Spark, which can process faster than MapReduce. For details, see &lt;a href=&#34;https://openstandia.jp/solution/hadoop-spark/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://openstandia.jp/solution/hadoop-spark/&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-operating-environment&#34;&gt;Hadoop Operating Environment&lt;/h2&gt;
&lt;p&gt;Hadoop is written in Java, so it requires a JVM. As of April 2022, Hadoop 3.3.2, the stable version at that time, supports Java 8 and Java 11. Any OS is acceptable as long as the JVM runs on it.&lt;/p&gt;
&lt;h3 id=&#34;operating-systems-where-hadoop-runs&#34;&gt;Operating Systems Where Hadoop Runs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Major Linux distributions&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;MacOSX&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hadoop has been confirmed to work normally on OpenJDK. Verification results for each JDK can be checked on the Hadoop Wiki page below.&lt;br&gt;
&lt;a href=&#34;https://cwiki.apache.org/confluence/display/HADOOP/Hadoop&amp;#43;Java&amp;#43;Versions&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Hadoop provides both compiled binary packages and source versions that users compile themselves.&lt;br&gt;
Compiled binary packages can be used immediately, but some settings cannot be extended, so users may need to build from source to enable the required features.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-license&#34;&gt;Hadoop License&lt;/h2&gt;
&lt;p&gt;Hadoop is one of Apache&amp;rsquo;s top-level projects.&lt;br&gt;
The license is Apache License 2.0, and users are not restricted in using, distributing, modifying, or distributing derivative versions of the software.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-official-site&#34;&gt;Hadoop Official Site&lt;/h2&gt;
&lt;p&gt;Hadoop&amp;rsquo;s official site is the URL below.
&lt;a href=&#34;http://hadoop.apache.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;http://hadoop.apache.org/&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The official Hadoop Wiki page also contains various information about Hadoop.
&lt;a href=&#34;https://cwiki.apache.org/confluence/display/HADOOP/Home&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://cwiki.apache.org/confluence/display/HADOOP/Home&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;hadoop-download&#34;&gt;Hadoop Download&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://hadoop.apache.org/releases.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://hadoop.apache.org/releases.html&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;three-layers-that-make-up-hadoop&#34;&gt;Three Layers That Make Up Hadoop&lt;/h2&gt;
&lt;p&gt;The Hadoop architecture mainly consists of the following three layers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Distributed processing engine (Hadoop uses Hadoop MapReduce)&lt;/li&gt;
&lt;li&gt;Resource manager (Hadoop uses Hadoop YARN)&lt;/li&gt;
&lt;li&gt;Distributed file system (Hadoop uses HDFS)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hadoop also often uses query engines to access data.&lt;/p&gt;
&lt;p&gt;Hadoop installs the above components on every computer and distributes data reading, writing, and processing.&lt;/p&gt;
&lt;h3 id=&#34;distributed-processing-engine&#34;&gt;Distributed Processing Engine&lt;/h3&gt;
&lt;p&gt;The distributed processing engine is the software group responsible for parallel distributed processing in Hadoop.&lt;/p&gt;
&lt;p&gt;By default, a distributed processing engine called MapReduce runs.&lt;/p&gt;
&lt;div class=&#34;alert alert-primary&#34; role=&#34;alert&#34;&gt;&lt;div class=&#34;h4 alert-heading&#34; role=&#34;heading&#34;&gt;MapReduce Processing Overview&lt;/div&gt;


MapReduce performs distributed processing as follows.

- Map: Outputs input in key-value format and can distribute each Map by node
- Shuffle: Sorts Map output
- Reduce: Aggregates identical keys
&lt;/div&gt;

&lt;p&gt;Representative distributed processing engines have the following characteristics. Lower items are faster.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MapReduce: Writes intermediate results to HDFS storage&lt;/li&gt;
&lt;li&gt;Tez: Writes intermediate results to YARN container storage&lt;/li&gt;
&lt;li&gt;Spark: Writes intermediate results to memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MapReduce appears to be a technology that will not disappear, but using Tez or Spark is recommended.&lt;/p&gt;
&lt;h3 id=&#34;resource-manager&#34;&gt;Resource Manager&lt;/h3&gt;
&lt;p&gt;The resource manager is responsible for managing resources such as CPU and memory in Hadoop.&lt;/p&gt;
&lt;p&gt;The resource manager used by MapReduce is Hadoop YARN, which manages application-level containers.&lt;/p&gt;
&lt;p&gt;Apache Mesos also exists and manages OS-level containers. It uses technologies such as Docker and Linux containers.&lt;/p&gt;
&lt;h3 id=&#34;distributed-file-system&#34;&gt;Distributed File System&lt;/h3&gt;
&lt;p&gt;The distributed file system is responsible for distributed data reading and writing in Hadoop. Distributed file systems used in Hadoop include the following.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;HDFS: Hadoop&amp;rsquo;s standard file system&lt;/li&gt;
&lt;li&gt;EMRFS: A file system that uses Amazon S3 as storage&lt;/li&gt;
&lt;li&gt;MapR-FS: A file system that rewrites HDFS in C. It is fast.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cloud Storage or Blob Storage also appear to be usable as storage, but it is unclear which distributed file system is used internally.&lt;/p&gt;
&lt;h2 id=&#34;hadoop-ecosystem-list&#34;&gt;Hadoop Ecosystem List&lt;/h2&gt;
&lt;p&gt;Software that composes Hadoop beyond the defaults, or related surrounding software, is called the Hadoop ecosystem.&lt;/p&gt;
&lt;p&gt;The Hadoop ecosystem can be combined as follows to perform various kinds of distributed processing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data warehouse configuration example: Hadoop + Tez + Hive
&lt;ul&gt;
&lt;li&gt;Hive enables Hadoop to be operated with SQL.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Machine learning configuration example: Hadoop + Spark
&lt;ul&gt;
&lt;li&gt;Spark&amp;rsquo;s in-memory processing is efficient for iterative processing that often occurs in machine learning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Full-text search configuration example: Hadoop + Elasticsearch
&lt;ul&gt;
&lt;li&gt;Elasticsearch for Apache Hadoop can be used to implement a full-text search service.&lt;/li&gt;
&lt;li&gt;An Elasticsearch cluster is used as Hadoop&amp;rsquo;s distributed file system.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Stream processing configuration example: each server and IoT device &amp;ndash;&amp;gt; Kafka &amp;ndash;&amp;gt; Hadoop
&lt;ul&gt;
&lt;li&gt;Kafka is used to perform stream processing from multiple servers and IoT devices and aggregate data into Hadoop.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Representative Hadoop ecosystem and related systems and their functions are introduced below.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Hadoop ecosystem&lt;/th&gt;
          &lt;th&gt;Function implemented&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Accumulo&lt;/td&gt;
          &lt;td&gt;KVS-type NoSQL. Emphasizes security&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Atlas&lt;/td&gt;
          &lt;td&gt;Governance control and compliance support&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Cascading&lt;/td&gt;
          &lt;td&gt;API that makes MapReduce easier to handle&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Drill&lt;/td&gt;
          &lt;td&gt;Distributed SQL engine for manipulating edge device data&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Falcon&lt;/td&gt;
          &lt;td&gt;Data lifecycle management&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Flume&lt;/td&gt;
          &lt;td&gt;Aggregates unstructured data from multiple data sources into Hadoop (stream data processing)&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache HBase&lt;/td&gt;
          &lt;td&gt;KVS-type NoSQL&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Hive&lt;/td&gt;
          &lt;td&gt;Manipulates data with SQL-like HiveQL queries. Emphasizes fault tolerance. Implements DWH&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Hue&lt;/td&gt;
          &lt;td&gt;Works with Hadoop and the Hadoop ecosystem through a GUI&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Impala&lt;/td&gt;
          &lt;td&gt;Manipulates data with SQL-like Impala SQL queries. Emphasizes speed. Implements real-time processing&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Kafka&lt;/td&gt;
          &lt;td&gt;Aggregates unstructured data from multiple data sources into Hadoop (stream data processing). Difference from Flume is noted separately&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Knox&lt;/td&gt;
          &lt;td&gt;Centralized authentication and access management&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Mahout&lt;/td&gt;
          &lt;td&gt;Linear algebra, statistical analysis, and machine learning library&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Mesos&lt;/td&gt;
          &lt;td&gt;Resource manager that manages OS-level containers&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Oozie&lt;/td&gt;
          &lt;td&gt;Job scheduler&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Phoenix&lt;/td&gt;
          &lt;td&gt;Real-time RDB that uses HBase as a datastore&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Pig&lt;/td&gt;
          &lt;td&gt;Data processing (ETL) tool&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Ranger&lt;/td&gt;
          &lt;td&gt;Grants attribute-based access rights to authenticated users&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Sentry&lt;/td&gt;
          &lt;td&gt;Grants role-based access rights to authenticated users&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Slider&lt;/td&gt;
          &lt;td&gt;Controls YARN applications, such as killing long-running applications&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Solr&lt;/td&gt;
          &lt;td&gt;Full-text search, also used with Elasticsearch&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Spark&lt;/td&gt;
          &lt;td&gt;Processes machine learning, SQL operations, R language, and graphs in memory&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Sqoop&lt;/td&gt;
          &lt;td&gt;Imports and exports structured data between RDBMS and Hadoop&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apache Tez&lt;/td&gt;
          &lt;td&gt;Distributed processing framework faster than MapReduce&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Presto&lt;/td&gt;
          &lt;td&gt;SQL query engine that outputs intermediate results to memory&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

      </description>
      
      <category>BigData</category>
      
      <category>Apache Hadoop</category>
      
    </item>
    
    <item>
      <title>Apache Spark</title>
      <link>https://www.devkuma.com/en/docs/apache-spark/</link>
      <pubDate>Sat, 24 Dec 2022 10:22:54 +0900</pubDate>
      <author>kc@example.com (kc kim)</author>
      <guid>https://www.devkuma.com/en/docs/apache-spark/</guid>
      <description>
        
        
        &lt;h2 id=&#34;apache-spark-overview&#34;&gt;Apache Spark Overview&lt;/h2&gt;
&lt;p&gt;Apache Spark is a &lt;strong&gt;distributed processing framework for cluster computing&lt;/strong&gt; that processes large-scale data such as big data and machine learning workloads. Spark development began in 2009 at AMPLab at the University of California, Berkeley, by Mate Zaharia, who was also a Hadoop committer. It is now managed and developed as one of the top-level projects of the Apache Software Foundation.&lt;/p&gt;
&lt;p&gt;Spark was developed to improve the slow processing speed of traditional MapReduce and to support a more flexible processing style that is not bound to repeatedly using Map and Reduce.&lt;/p&gt;
&lt;p&gt;Spark can run independently as a distributed processing framework, so it is attracting attention as a post-Hadoop technology. At the same time, it can also be used as a replacement for MapReduce within the Hadoop core system, which consists of MapReduce, HDFS, YARN, and related components.&lt;/p&gt;
&lt;h2 id=&#34;key-features-of-apache-spark&#34;&gt;Key Features of Apache Spark&lt;/h2&gt;
&lt;p&gt;Spark&amp;rsquo;s major features include the ability to easily program flexible processing models using the concise APIs provided by Spark, and the ability to process large-scale data in far less time than traditional MapReduce.&lt;/p&gt;
&lt;p&gt;In traditional MapReduce, Map and Reduce had to be performed as one processing model, so applications running on Hadoop had to be developed according to that style. This made it difficult to develop flexible processing models.&lt;/p&gt;
&lt;p&gt;MapReduce also writes processing results to disk after each Map and Reduce operation, making it difficult to improve processing speed. In contrast, Spark can run multiple Map operations consecutively on datasets loaded into memory (RDDs), reduce the results, and then run the next Map operation on the same dataset while keeping it in memory rather than writing it to disk. For this reason, Spark can sometimes achieve more than 100 times faster processing than MapReduce, although it can also write results to disk like traditional MapReduce.&lt;/p&gt;
&lt;p&gt;Spark&amp;rsquo;s features are as follows.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Speed
&lt;ul&gt;
&lt;li&gt;Fast in-memory processing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Ease of Use
&lt;ul&gt;
&lt;li&gt;Ease of use through support for various languages such as Java, Scala, Python, R, and SQL&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Generality
&lt;ul&gt;
&lt;li&gt;Provides various components such as SQL, Streaming, machine learning, and graph computation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Run Everywhere
&lt;ul&gt;
&lt;li&gt;Can run on various clusters such as YARN, Mesos, and Kubernetes&lt;/li&gt;
&lt;li&gt;Supports various file formats and storage systems such as HDFS, Cassandra, and HBase&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;apache-spark-component-structure&#34;&gt;Apache Spark Component Structure&lt;/h2&gt;
&lt;p&gt;Spark is a distributed processing framework consisting of the following components.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spark Core, including Scala, Java, Python, and R APIs&lt;/li&gt;
&lt;li&gt;Spark SQL + DataFrames&lt;/li&gt;
&lt;li&gt;Spark Streaming&lt;/li&gt;
&lt;li&gt;MLlib&lt;/li&gt;
&lt;li&gt;GraphX&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;spark-core&#34;&gt;Spark Core&lt;/h3&gt;
&lt;p&gt;Spark keeps data to be processed in the form of RDDs (Resilient Distributed Datasets).&lt;br&gt;
An RDD is an immutable collection that can be executed in parallel and is distributed across computers.&lt;br&gt;
In the Spark programming model, processing is performed by applying various methods provided by Spark Core to these RDDs. When developers manipulate RDDs through the APIs provided by Spark Core, they can perform distributed processing without being conscious of distributed data.&lt;br&gt;
This is one of Spark&amp;rsquo;s strengths: it makes flexible processing easy to program.&lt;br&gt;
Spark Core APIs are provided not only for Scala, Spark&amp;rsquo;s development language, but also for Java, Python, and R as standard APIs. Some third-party libraries also provide Spark API access from Clojure, a functional language that runs on the JVM like Scala, and APIs for other languages are expected to continue increasing.&lt;/p&gt;
&lt;h3 id=&#34;spark-sql--dataframes&#34;&gt;Spark SQL + DataFrames&lt;/h3&gt;
&lt;p&gt;In addition to manipulating RDDs through Spark APIs, Spark can also use a SQL-like language called Spark SQL to manipulate abstract datasets called DataFrames, which have named columns like database tables.&lt;br&gt;
This is an interface that allows users who have not learned languages such as Scala, Java, Python, or R to process data with Spark if they have SQL knowledge.&lt;/p&gt;
&lt;h3 id=&#34;spark-streaming&#34;&gt;Spark Streaming&lt;/h3&gt;
&lt;p&gt;Spark Streaming is an engine that provides real-time distributed processing for streaming data continuously sent to Spark.&lt;br&gt;
Apache Storm is a similar framework for processing streaming data. While Apache Storm is specialized for streaming data processing, Spark Streaming is Spark&amp;rsquo;s engine for real-time data processing.&lt;br&gt;
Apache Flink is another streaming processing framework. Because it can also perform batch processing and includes machine learning and graph processing libraries, it has a structure quite similar to Spark and is considered a Spark competitor.&lt;/p&gt;
&lt;p&gt;Apache Storm&lt;br&gt;
&lt;a href=&#34;http://storm.apache.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;http://storm.apache.org/&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Apache Flink&lt;br&gt;
&lt;a href=&#34;http://flink.apache.org/introduction.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;http://flink.apache.org/introduction.html&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;mllib&#34;&gt;MLlib&lt;/h3&gt;
&lt;p&gt;MLlib is Spark&amp;rsquo;s machine learning library. It allows users to write programs that perform machine learning using Spark&amp;rsquo;s flexible processing style.&lt;br&gt;
Before this, Mahout existed as software for machine learning in cooperation with Hadoop, but Hadoop + Mahout required machine learning programs to be written using the MapReduce programming model, which caused slower processing.&lt;br&gt;
Spark can process faster than Hadoop, and machine learning using Spark and MLlib is attracting attention because it is efficient.&lt;/p&gt;
&lt;p&gt;Apache Mahout&lt;br&gt;
&lt;a href=&#34;http://mahout.apache.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;http://mahout.apache.org/&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&#34;graphx&#34;&gt;GraphX&lt;/h3&gt;
&lt;p&gt;GraphX provides APIs for parallel processing of graph data through Spark.&lt;br&gt;
It enables parallel processing of graph data with Spark&amp;rsquo;s fast processing speed.&lt;/p&gt;
&lt;p&gt;Some components described above do not include storage components.
Spark can use various existing storage systems for reading and writing. The following are some storage systems that can be integrated with Spark, including through third-party libraries.&lt;/p&gt;
&lt;p&gt;HDFS, Cassandra, HBase, S3, MongoDB, Couchbase, Riak, Neo4j, OrientDB&lt;/p&gt;
&lt;p&gt;Readable data sources also vary widely, from files such as CSV and XML to search engines such as Solr and Elasticsearch.&lt;/p&gt;
&lt;p&gt;List of packages for integration between Spark and various data sources&lt;br&gt;
&lt;a href=&#34;https://spark-packages.org/?q=tags%3A%22Data%20Sources%22&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://spark-packages.org/?q=tags%3A%22Data%20Sources%22&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In addition to packages that enable integration with various data sources, various packages for extending the existing Spark ecosystem are also provided.
These packages are published as Spark Packages at the following site.&lt;br&gt;
&lt;a href=&#34;https://spark-packages.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;https://spark-packages.org/&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;apache-spark-operating-environment&#34;&gt;Apache Spark Operating Environment&lt;/h2&gt;
&lt;p&gt;Spark officially supports the following OS environments. Java must also be installed to run Spark.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Major Linux distributions&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Linux&lt;/li&gt;
&lt;li&gt;MacOSX&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Versions supported by the APIs provided by Spark are as follows.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Java 8, 11, 17 (versions below Java 8u201 are deprecated in Spark 3.2.0)&lt;/li&gt;
&lt;li&gt;Scala 2.12, 2.13 (Spark 3.3.0 must use a compatible Scala 2.12.x version)&lt;/li&gt;
&lt;li&gt;Python 3.7 or later (with Python 3.9, Apache Arrow and pandas UDFs might not work)&lt;/li&gt;
&lt;li&gt;R 3.5 or later&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;apache-spark-license-type&#34;&gt;Apache Spark License Type&lt;/h2&gt;
&lt;p&gt;Spark is one of Apache&amp;rsquo;s top-level projects.&lt;br&gt;
The license is Apache License 2.0, and users are not restricted in using, distributing, modifying, or distributing derivative versions of the software.&lt;/p&gt;
&lt;h2 id=&#34;apache-spark-reference-information&#34;&gt;Apache Spark Reference Information&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://spark.apache.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Spark official site&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Spark official documentation&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.databricks.com/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Spark community site&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is provided by Databricks, a company started by Spark developers.&lt;/p&gt;

      </description>
      
      <category>BigData</category>
      
      <category>Apache Spark</category>
      
    </item>
    
    <item>
      <title>PySpark Concepts and Key Features</title>
      <link>https://www.devkuma.com/en/docs/pyspark/</link>
      <pubDate>Fri, 06 Jan 2023 12:36:13 +0900</pubDate>
      <author>kc@example.com (kc kim)</author>
      <guid>https://www.devkuma.com/en/docs/pyspark/</guid>
      <description>
        
        
        &lt;h2 id=&#34;what-is-pyspark&#34;&gt;What Is PySpark?&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/api/python/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PySpark&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt; is the Python API for Apache Spark, an open source distributed computing framework and set of libraries for real-time large-scale data processing. If you are already familiar with Python and libraries such as Pandas, PySpark is a good language for learning how to build more scalable analytics and pipelines.&lt;/p&gt;
&lt;p&gt;Apache Spark is basically a compute engine that works with huge datasets by processing them in parallel and batch systems. Spark is written in Scala, and PySpark was released to support collaboration between Spark and Python. In addition to providing APIs for Spark, PySpark uses the Py4J library to support interfaces with RDDs (Resilient Distributed Datasets).&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://www.devkuma.com/docs/apache-spark/python-spark-pyspark.png&#34; alt=&#34;Apache Spark and Python logos&#34;&gt;&lt;/p&gt;
&lt;p&gt;The main data type used in PySpark is the Spark DataFrame. This object can be thought of as a table distributed across a cluster, and it has functionality similar to data frames in R and Pandas. To perform distributed computation with PySpark, you must run operations on Spark DataFrames rather than other Python data types.&lt;/p&gt;
&lt;p&gt;One of the main differences between Pandas and Spark DataFrames is eager execution versus lazy execution. In PySpark, operations are delayed until results are actually requested in the pipeline. For example, you can specify a job that loads a dataset from Amazon S3 and applies multiple transformations to a DataFrame, but those operations are not applied immediately. Instead, the transformation graph is recorded, and when the data is actually needed, such as when writing results back to S3, the transformations are applied as a single pipeline job. This approach is used to avoid bringing the entire DataFrame into memory and to enable more effective processing across a cluster of systems. With Pandas DataFrames, everything is brought into memory and all Pandas operations are applied immediately.&lt;/p&gt;
&lt;h2 id=&#34;pyspark-features-and-libraries&#34;&gt;PySpark Features and Libraries&lt;/h2&gt;
&lt;p&gt;Py4J is a widely used library integrated into PySpark that enables Python to dynamically interface with JVM (Java Virtual Machine) objects. PySpark provides many libraries for writing efficient programs. It is also compatible with various external libraries, including the following.&lt;/p&gt;
&lt;h3 id=&#34;pysparksql&#34;&gt;PySparkSQL&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PySparkSQL&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt; is a PySpark library that applies SQL-like analysis to large volumes of structured or semi-structured data. SQL queries can also be used with PySparkSQL.&lt;/p&gt;
&lt;h3 id=&#34;mllib&#34;&gt;MLlib&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://spark.apache.org/docs/latest/ml-guide.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;MLlib&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt; is the wrapper machine learning (ML) library for PySpark and Spark. MLlib supports many machine learning algorithms for classification, regression, clustering, collaborative filtering, dimensionality reduction, and basic optimization primitives.&lt;/p&gt;
&lt;h3 id=&#34;graphframes&#34;&gt;GraphFrames&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://graphframes.github.io/graphframes/docs/_site/index.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;GraphFrames&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt; is a graph processing library that provides a set of APIs for efficiently performing graph analysis using PySpark Core and PySparkSQL. It is optimized for fast distributed computing.&lt;/p&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;For data engineers who know Python but not Scala, PySpark is much easier to use than pure Spark, but it also has drawbacks. PySpark errors show both Java stack trace errors and references to Python code, so debugging PySpark applications can be very difficult.&lt;/p&gt;
&lt;p&gt;Spark includes more processing overhead and more complex setup than other data processing options. Ray and Dask have emerged recently. Because Dask is a pure Python framework, most data engineers can use Dask immediately.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.databricks.com/kr/glossary/pyspark&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PySpark – Databricks&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.dominodatalab.com/data-science-dictionary/pyspark&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;What is PySpark? | Domino Data Science Dictionary&lt;i class=&#34;fas fa-external-link-alt&#34;&gt;&lt;/i&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

      </description>
      
      <category>BigData</category>
      
      <category>Apache Spark</category>
      
      <category>PySpark</category>
      
    </item>
    
  </channel>
</rss>
