Apache HBase Overview

What is HBase?

Apache HBase is an open source distributed database management system developed and published by the Apache Software Foundation. It is one of the NoSQL systems, with a structure different from mainstream relational databases.

It models Google’s BigTable, which was developed and used internally at Google, and reimplements a similar database management system as open source software.

HBase builds a database on HDFS (Hadoop Distributed File System), a distributed file system also developed and published by the ASF. Like a relational database, it structures data in tables, but unlike a typical RDBMS, it stores the values of each column together in storage. This approach is called column-oriented storage.

Each row in a table stores data as pairs of qualifiers, which correspond to column names, and cells, which correspond to field values. Multiple qualifiers are grouped into a column family. Cell data is automatically versioned, so data from an arbitrary point in the past can be retrieved.

HBase is designed for a distributed environment where multiple server computers manage one database. It automatically shards data across multiple nodes, so applications do not need to worry about the target location. It also provides strong data consistency, preventing stale data from being externally referenced during processing.

Database operations are performed not with SQL, as in RDBs, but by calling APIs such as the Java API or RESTful API from an application. HBase is well suited to large-scale distributed big data processing such as Apache Hadoop MapReduce.

HBase is not a replacement for SQL databases, but it can add an SQL layer by using software such as Apache Phoenix. With an SQL layer, JDBC drivers and similar tools can be used to handle various analytical workloads more easily and in a familiar way.

Apache HBase|GitHub

HBase History

Year Event
November 2006 Google published the BigTable paper.
February 2007 The initial HBase prototype was created as a Hadoop contribution.
October 2007 The first usable HBase was released with Hadoop 0.15.0.
January 2008 HBase became a subproject of Hadoop.
October 2008 HBase 0.18.1 was released.
January 2009 HBase 0.19.0 was released.
September 2009 HBase 0.20.0 was released.
May 2010 HBase became an Apache top-level project.
June 2010 The first developer version, HBase 0.89.20100621, was released.
January 2011 HBase 0.90.0 was released.
Mid-2011 HBase 0.92.0 was released.

HBase Features

The following are some important characteristics of Apache HBase.

Distributed

Apache HBase can run in two modes: pseudo-distributed mode and fully distributed mode. Pseudo-distributed mode is used for testing and runs on a single node, while fully distributed mode is used in production and runs on a cluster of nodes.

Big Data Store

Apache HBase is designed to store very large data in tables with billions of rows and millions or billions of columns. Because it runs on Hadoop HDFS, it provides low-latency real-time reads and writes for data. It also provides better performance for read operations on massive data stored in tables.

Non-Relational

As already noted, Apache HBase is a non-relational database, so it does not follow the relational database model. In relational database management, data is stored in tables as rows and columns. SQL can be used to access that data, but Apache HBase uses storage and query mechanisms where data storage is not fixed to that format. In Apache HBase, schemas are flexible and can be extended according to requirements.

Flexible Data Model

Apache HBase provides a flexible data model for storing data in tables. A table has one or more column families. User data is stored in rows, which are collections of key/value pairs. In a table, each row is uniquely identified by a row key.

Scalable

Apache HBase regions are horizontally scalable, with rows distributed by region. A table can be stored across multiple regions, and when a region becomes very large, the data is split into two regions around the middle row key.

HBase Functions

Apache HBase provides the following functions.

  • Supports linear and modular scalability.
  • Provides strict consistency for reads and writes.
  • Provides automatic and configurable table sharding.
  • Supports automatic failover between RegionServers.
  • Provides convenient base classes for supporting Hadoop MapReduce jobs with Apache HBase tables.
  • Makes the Java API easy to use for client access.
  • Provides block cache and Bloom filters for real-time queries.
  • Provides query predicate pushdown through server-side filters.
  • Provides an extensible JRuby-based (JIRB) shell.

HBase Pros and Cons

HBase Advantages

  • Effective for reliably handling large volumes of data.
    • A master that controls the overall distributed system manages overall data consistency and guarantees consistency among replicated data.
  • Suitable for supporting large-scale data analysis processing.
    • Easy to use as MapReduce input.
    • Optimized for use with HDFS, MapReduce, and related tools.
  • If performance issues occur, performance can be maintained by adding only Region servers.
  • Failover is easy, and management is convenient.

HBase Disadvantages

  • It is convenient as MapReduce input, but when used together with file input, CPU usage can rise and Region servers can go down easily.
  • There is material about conditions for appropriate HBase settings, but because cluster size and basic specifications differ, it is difficult to apply directly.
  • Regions for a specific table can easily become concentrated on a specific Region server, leading to performance degradation.
    • Proper HBase design is required during configuration.

Differences Between RDBMS and HBase

The following table summarizes the major differences between relational databases and HBase.

RDBMS HBase
Uses tables as the database. Uses regions as the database.
Supported file systems are FAT, NTFS, and EXT. The supported file system is HDFS.
Uses commit logs to store logs. Uses WAL (Write-Ahead Logs) to store logs.
The reference system used is a coordinate system. The reference system is the one used by ZooKeeper.
Uses a primary key. Uses a row key.
Partitioning is supported. Sharding is supported.
Uses Row, Column, and Cell. Uses Row, Column family, Column, and Cell.

Differences Between HDFS and HBase

The following table summarizes the major differences between HDFS and HBase.

HDFS HBase
HDFS provides a file system for distributed storage. HBase provides table-form, column-based data storage.
HDFS provides storage optimized for large files. HBase provides optimization for table-form data.
Uses flat files. Uses key-value pair data.
The data model is not flexible. Provides a flexible data model.
Uses a file system and processing framework. Uses table-form storage with built-in Hadoop MapReduce support.
Mostly optimized for write-once-read-many workloads. Optimized for many reads and writes.

Differences Between Row-Oriented and Column-Oriented Data Stores

The following table summarizes the major differences between row-oriented and column-oriented data stores.

Row-oriented data store Column-oriented data store
Row-oriented data stores are efficient for adding and modifying records. Column-oriented data stores are efficient for reading data.
They read pages containing entire rows. They read only the required columns.
Best suited for OLTP systems. Not yet suitable for OLTP systems.
Serializes the full value of a row. Serializes the full value of a column.
Stores rows in contiguous pages of memory. Stores columns from pages in memory.

What is OLTP (Online Transactional Processing)? It refers to online transaction processing and means batch transaction processing against databases by online users over a network.

Pros and Cons of Column-Oriented Databases

The following are the pros and cons of column-oriented databases.

Advantages

  • Built-in support for efficiency and data compression.
  • Supports fast data retrieval.
  • Management and configuration are simplified in column-oriented databases.
  • Suitable for high performance in aggregate queries, such as COUNT, SUM, AVG, MIN, and MAX.
  • Provides an automatic sharding mechanism that distributes larger regions into smaller regions, making it efficient for partitioning.

Disadvantages

  • JOIN queries and data across multiple tables are not optimized.
  • Storage efficiency decreases because partitions must be created for frequent deletes and updates.
  • Because of the non-relational nature, designing partitions and indexes is very difficult.

HBase Use Cases

  • Used to guarantee data consistency or for analysis.
    • Example: Facebook, eBay, Adobe, LINE, etc.
  • Monitoring systems.
  • Tracking user actions.
  • Audit logging systems.
  • Real-time analytics.
    • Cases where Hadoop must be used to analyze large amounts of data.
  • Used as the primary storage for large-scale social networking services.
    • Message-oriented systems, such as Twitter-like messages and statuses.
  • Content management systems that serve content from HBase.
  • Standard use cases such as storing web pages during web crawling.

References