Apache Hadoop

Apache Hadoop Overview

Hadoop is a framework implemented in Java for distributed storage and analysis of large-scale data. Hadoop originated from Google’s MapReduce and Google File System, which were distributed processing foundations for efficiently processing large amounts of data.

Google published papers about these systems in 2004, and Doug Cutting and Mike Cafarella developed Hadoop based on them. The name Hadoop came from the name Doug’s son gave to a yellow elephant stuffed toy. It was adopted because it had no meaning, was simple, and was not used elsewhere. The yellow elephant is also Hadoop’s mascot.

Hadoop character

Because Hadoop is a distributed processing platform, each process is divided across machines in a cluster (Map), and the results processed by each machine are aggregated (Reduce) to obtain the final result.
Recently, demand for data mining has increased, such as extracting target data from large amounts of data (BigData) or reading trends from stored data. There is also growing demand not only to process BigData, but also to produce such information in a shorter time.
Previously, dedicated products such as data warehouses were needed to process BigData. Hadoop makes this kind of data processing possible by connecting multiple ordinary server machines together, or scaling out.

Although a Hadoop system consists of multiple servers, distribution across multiple machines increases system flexibility. To improve processing performance, you only need to add systems to the Hadoop cluster. Because a Hadoop cluster can be composed of ordinary server machines, hardware procurement is easy. On the software side, a Hadoop cluster can be scaled up simply by installing and configuring Hadoop on servers added to the cluster. These characteristics make it highly scalable in both hardware and software.

Because cloud services now make it easy to start multiple servers, a Hadoop cluster can be built in the cloud only when data processing is needed. If performance is insufficient, servers can be added; if resources remain, servers can be reduced; and when a job is complete, all machines in the Hadoop cluster can be released. For this reason, Hadoop is expected to be used in more and more scenarios.

Until Hadoop version 1, MapReduce was the only parallel processing framework, but from Hadoop version 2, other parallel processing frameworks such as Storm, Spark, Tez, and Impala became available. Interfaces for processing data in Hadoop other than MapReduce (Java) also increased. For example, through Hive and Pig running on Impala and Tez, users can access data using queries almost equivalent to familiar SQL. In addition, Storm and Spark enable real-time data processing through streaming, making it possible to use Hadoop systems even with data outside HDFS.

Hadoop Features

Hadoop consists of the following four core modules.

Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Hadoop Common
Hadoop YARN

There are also the following modules that are separate Hadoop projects.

Apache Ozone
Apache Submarine

HDFS (Hadoop Distributed File System)

HDFS is Hadoop’s own distributed file system. To users it appears as one large file system, but it stores files across nodes. To prevent data loss if one node fails, the same data is stored on three nodes by default.

MapReduce

MapReduce is a framework for processing distributed data in parallel. In the Map step, each slave node processes its data, and in the Reduce step, the processing results distributed and executed across multiple nodes in the Map step are aggregated.

Hadoop Common

Hadoop Common is a set of utilities that support Hadoop functionality.

YARN (Yet Another Resource Negotiator)

Until Hadoop version 1, YARN was not an independent component, but in Hadoop version 2 it became an independent module dedicated to resource management. It can manage MapReduce resources and job scheduling, as well as resources for other distributed processing frameworks such as Giraph, Storm, Spark, Tez, and Impala.

Apache Ozone

Apache Ozone is a project for implementing distributed object storage in Hadoop. It is designed to scale to hundreds of billions of files and blocks, and also supports operation in container environments such as YARN and Kubernetes. It can be accessed using multiple protocols, including S3 and the Hadoop File System API. It was originally a Hadoop subproject, but became an independent Apache top-level project.

Apache Submarine

Apache Submarine is a project that enables deep learning applications such as TensorFlow, PyTorch, and MxNet to run on resource management platforms such as YARN. It was originally a Hadoop subproject, but became an independent Apache top-level project. It can be used with Hadoop 2.7.3 or later.

Hadoop Use Cases

Hadoop can use Apache Spark, which can process faster than MapReduce. For details, see https://openstandia.jp/solution/hadoop-spark/.

Hadoop Operating Environment

Hadoop is written in Java, so it requires a JVM. As of April 2022, Hadoop 3.3.2, the stable version at that time, supports Java 8 and Java 11. Any OS is acceptable as long as the JVM runs on it.

Operating Systems Where Hadoop Runs

Major Linux distributions
Windows
MacOSX

Hadoop has been confirmed to work normally on OpenJDK. Verification results for each JDK can be checked on the Hadoop Wiki page below.
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

Hadoop provides both compiled binary packages and source versions that users compile themselves.
Compiled binary packages can be used immediately, but some settings cannot be extended, so users may need to build from source to enable the required features.

Hadoop License

Hadoop is one of Apache’s top-level projects.
The license is Apache License 2.0, and users are not restricted in using, distributing, modifying, or distributing derivative versions of the software.

Hadoop Official Site

Hadoop’s official site is the URL below. http://hadoop.apache.org/

The official Hadoop Wiki page also contains various information about Hadoop. https://cwiki.apache.org/confluence/display/HADOOP/Home

Hadoop Download

https://hadoop.apache.org/releases.html

Three Layers That Make Up Hadoop

The Hadoop architecture mainly consists of the following three layers.

Distributed processing engine (Hadoop uses Hadoop MapReduce)
Resource manager (Hadoop uses Hadoop YARN)
Distributed file system (Hadoop uses HDFS)

Hadoop also often uses query engines to access data.

Hadoop installs the above components on every computer and distributes data reading, writing, and processing.

Distributed Processing Engine

The distributed processing engine is the software group responsible for parallel distributed processing in Hadoop.

By default, a distributed processing engine called MapReduce runs.

MapReduce Processing Overview

MapReduce performs distributed processing as follows. - Map: Outputs input in key-value format and can distribute each Map by node - Shuffle: Sorts Map output - Reduce: Aggregates identical keys

Representative distributed processing engines have the following characteristics. Lower items are faster.

MapReduce: Writes intermediate results to HDFS storage
Tez: Writes intermediate results to YARN container storage
Spark: Writes intermediate results to memory

MapReduce appears to be a technology that will not disappear, but using Tez or Spark is recommended.

Resource Manager

The resource manager is responsible for managing resources such as CPU and memory in Hadoop.

The resource manager used by MapReduce is Hadoop YARN, which manages application-level containers.

Apache Mesos also exists and manages OS-level containers. It uses technologies such as Docker and Linux containers.

Distributed File System

The distributed file system is responsible for distributed data reading and writing in Hadoop. Distributed file systems used in Hadoop include the following.

HDFS: Hadoop’s standard file system
EMRFS: A file system that uses Amazon S3 as storage
MapR-FS: A file system that rewrites HDFS in C. It is fast.

Cloud Storage or Blob Storage also appear to be usable as storage, but it is unclear which distributed file system is used internally.

Hadoop Ecosystem List

Software that composes Hadoop beyond the defaults, or related surrounding software, is called the Hadoop ecosystem.

The Hadoop ecosystem can be combined as follows to perform various kinds of distributed processing.

Data warehouse configuration example: Hadoop + Tez + Hive
- Hive enables Hadoop to be operated with SQL.
Machine learning configuration example: Hadoop + Spark
- Spark’s in-memory processing is efficient for iterative processing that often occurs in machine learning.
Full-text search configuration example: Hadoop + Elasticsearch
- Elasticsearch for Apache Hadoop can be used to implement a full-text search service.
- An Elasticsearch cluster is used as Hadoop’s distributed file system.
Stream processing configuration example: each server and IoT device –> Kafka –> Hadoop
- Kafka is used to perform stream processing from multiple servers and IoT devices and aggregate data into Hadoop.

Representative Hadoop ecosystem and related systems and their functions are introduced below.

Hadoop ecosystem	Function implemented
Apache Accumulo	KVS-type NoSQL. Emphasizes security
Apache Atlas	Governance control and compliance support
Cascading	API that makes MapReduce easier to handle
Apache Drill	Distributed SQL engine for manipulating edge device data
Apache Falcon	Data lifecycle management
Apache Flume	Aggregates unstructured data from multiple data sources into Hadoop (stream data processing)
Apache HBase	KVS-type NoSQL
Apache Hive	Manipulates data with SQL-like HiveQL queries. Emphasizes fault tolerance. Implements DWH
Apache Hue	Works with Hadoop and the Hadoop ecosystem through a GUI
Apache Impala	Manipulates data with SQL-like Impala SQL queries. Emphasizes speed. Implements real-time processing
Apache Kafka	Aggregates unstructured data from multiple data sources into Hadoop (stream data processing). Difference from Flume is noted separately
Apache Knox	Centralized authentication and access management
Apache Mahout	Linear algebra, statistical analysis, and machine learning library
Apache Mesos	Resource manager that manages OS-level containers
Apache Oozie	Job scheduler
Apache Phoenix	Real-time RDB that uses HBase as a datastore
Apache Pig	Data processing (ETL) tool
Apache Ranger	Grants attribute-based access rights to authenticated users
Apache Sentry	Grants role-based access rights to authenticated users
Apache Slider	Controls YARN applications, such as killing long-running applications
Apache Solr	Full-text search, also used with Elasticsearch
Apache Spark	Processes machine learning, SQL operations, R language, and graphs in memory
Apache Sqoop	Imports and exports structured data between RDBMS and Hadoop
Apache Tez	Distributed processing framework faster than MapReduce
Presto	SQL query engine that outputs intermediate results to memory