What is a Big Data Analytics Platform? Building a Data Pipeline

What is Big Data?

Big data is data at a scale that ordinary software cannot process, such as terabytes, petabytes, or exabytes.

The goal here is to visualize and analyze big data.

Small-data software such as Excel or a traditional RDB cannot handle big data because of hardware resource limits: CPU time is insufficient, memory cannot hold the data, and storage may not be large enough. To process big data, use software that supports parallel processing and multiple computers.

Big data pipeline

Data Source

Big data pipeline: data source

Data Source

A data source is the origin of the data to be analyzed.

Examples include web servers, IoT devices such as factory or vehicle sensors, and mobile devices. The meaning of data source depends on the viewpoint. From a BI tool, for example, NoSQL, a DWH, or a data mart may be treated as a data source.

Data can be divided into structured, unstructured, and semi-structured data.

Structured Data

Structured data is two-dimensional table data with a defined schema, such as rows, columns, and data types.

Structured data is easy for computers to handle through RDBs and SQL. However, rows and columns must be defined in advance, data must be converted into the fixed schema before storage, and nested data is usually handled through joins.

Unstructured Data

Unstructured data is data that is not in a two-dimensional table format and does not have a defined schema.

Examples include music, photos, and text logs.

<6>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test

Unstructured data is easy to store as-is, but difficult to query with RDB and SQL because there is no schema.

Semi-structured Data

Semi-structured data is unstructured data whose elements have attribute names, but it is not necessarily table data.

Examples include JSON, XML, AVRO, Parquet, and ORC. A log line can be converted into JSON by assigning names to each element.

{
  "jsonPayload": {
    "priority": "6",
    "host": "192.168.0.1",
    "ident": "fluentd",
    "pid": "11111",
    "message": "[error] Syslog test"
  }
}

Semi-structured data has attributes, can be queried by compatible databases, and allows schema changes later. It can also contain nested related data.

ETL and ELT

Big data pipeline: ETL/ELT

ETL

ETL means Extract, Transform, and Load.

ELT

ELT changes the order of Transform and Load in ETL.

Data sources often do not perform transformation because production web servers should not be overloaded, IoT devices may not have enough resources, and mobile devices belong to customers.

ETL can be batch-oriented or streaming-oriented. Batch ETL focuses on throughput and runs at intervals such as hourly or nightly. Streaming ETL focuses on real time and runs when data is generated.

Extraction tools include Embulk for batch ETL and fluentd, beats, or Kafka Producer API for streaming ETL. A Pub/Sub messaging system can be placed between Extract and Transform to distribute processing and temporarily buffer bursts of data.

Transform processing can be done with SQL, pandas, fluentd, logstash, Kafka Streams, or Spark Streaming. The transformed data is then loaded into NoSQL, a data lake, a data warehouse, or a data mart.

Data Lake

Big data pipeline: data lake

Data Lake

A data lake is a storage repository that accumulates all types of data, including structured, unstructured, and semi-structured data, for future use.

Examples include Hadoop HDFS and Amazon S3. Because a data lake keeps large amounts of data in various formats, scalable storage is usually selected.

NoSQL Database

Big data pipeline: NoSQL

NoSQL Database

A NoSQL database is a distributed database specialized for specific purposes.

NoSQL databases generally relax some constraints of RDBs to pursue performance. Compared with RDBs, they often target low-latency processing, use flexible data models such as key-value, document, or graph, prefer client-side joins or nested data, and scale by adding nodes.

Common NoSQL models include in-memory key-value caches such as Memcached and Redis, key-value stores such as DynamoDB, wide-column stores such as Cassandra and HBase, graph databases such as Neo4j, and document stores such as Elasticsearch and MongoDB.

Data Warehouse

Big data pipeline: data warehouse

Data Warehouse (DWH)

A data warehouse is a database that stores and analyzes structured data.

A DWH is used for analytics or OLAP, while an RDB is commonly used for transaction processing or OLTP. DWH systems generally use column-oriented storage and are suited to large-scale analysis.

Columnar Database

A columnar database stores data in storage blocks by column.

Column-oriented storage formats include ORC and Parquet. Columnar databases read only necessary columns during analysis, making reads efficient. Writes are less efficient because values must be appended across column blocks, but compression is often very effective because values in the same column have similar types and patterns.

Examples include Snowflake, Amazon Redshift, and Hadoop with Hive or Presto using ORC or Parquet.

Compared with a data lake, a DWH is optimized for analyzing current structured data quickly, while a data lake stores all data for possible future use.

Data Mart

Big data pipeline: data mart

Data Mart

A data mart is a database that stores structured data for analysis.

A data mart can be considered a smaller data warehouse. It usually serves a single department, contains only necessary data, and is fast for small analysis workloads. Examples include an RDB, a small DWH cluster, or CSV files.

SQL Query Engine

Big data pipeline: SQL query engine

SQL Query Engine

A SQL query engine aggregates and manipulates data using SQL.

SQL query engines were created because users wanted to manipulate data more easily without writing programs. Examples include ksqlDB, Apache Flink SQL, Elasticsearch SQL, PartiQL, Presto, Apache Hive, SQL Workbench/J, and DBeaver.

BI Tool

Big data pipeline: BI tool

BI Tool

A BI tool visualizes data stored in a datastore.

Representative BI tools include Tableau, Grafana, Kibana, and QuickSight.

Datastore Summary

RDBs are primarily used for transactions and fixed-schema structured data. NoSQL databases are often used for performance-focused workloads and flexible schemas. DWH systems are used for OLAP and structured or semi-structured analysis. Data lakes store structured, unstructured, and semi-structured data with a schemaless or catalog-based approach.