HBase Data Model

Data structure

HBase tables have no fixed type and are stored as byte arrays (byte[]).
HBase rows are sorted in ascending order by a unique row key, and this row key is used when reading and writing table values.
As mentioned earlier, data managed by HBase is divided into Regions by fixed ranges, and a table has multiple Regions. HBase columns are also grouped in units called column families.
Table data is divided by Region and by column family, then exported as files. Files are separate, but if they belong to the same Region, they are stored on the same Region server.

HBase Table

Each HBase data cell value has a timestamp, so versions are managed. Files are stored in the following format.

Row(Row Key): Column family: Column: timestamp: value

Data model

The Apache HBase data model is a distributed, multidimensional, persistent, sorted map indexed by column key, row key, and timestamp. This is why Apache HBase is also called a key-value storage system.

The basic unit of HBase is a column. Columns form a Column Family, and Column Families form a table. Each Row in a table has a Row Key and can be identified by it. See the figure below.

HBase Column family

This table has two column families: Customer and Sales.
The Customer column family has two columns: Name and City.
The Sales column family has two columns: Product and Amount.
A Row consists of Row Key, Customer CF, and Sales CF.

The following are data model terms used in Apache HBase.

Table

An Apache HBase table consists of multiple Rows. It organizes data into tables made of characters that are easy to use with the file system.

A set of Rows, with a Row Key and multiple column families.
In schema definition, only Column Families are defined.

Row

Apache HBase stores data based on Rows, and each Row has a unique Row Key. The Row Key is represented as a byte array.

Row Key

Because data is collected based on the Row Key, the Row Key must be designed so that data can be distributed appropriately. The goal of Row Key design is to store data so similar Rows are near each other.

Sorted lexicographically as arbitrary bytes.
An empty byte string means the start and end of the table.
Anything can be a row key, including strings, integers, binary data, and serialized data structures.

In general, a Row Key pattern can also be organized in website domain format. If domains are stored in reverse order, such as org.apache.mair and org.apache.jira, Apache domain data can be stored near each other.

Column

A Column consists of a Column Family and a Column Qualifier.

Column Family

A Column Family is used to store Rows and also provides the structure for storing data in Apache HBase.

It consists of characters and strings and can be used with file system paths. Each row in a table has the same column families, but a row does not need to store data in every column family.

Column Qualifier

A Column Qualifier provides an index for data stored in a Column Family. The Column qualifier is not a fixed value, so various data can be entered. It can be thought of as a map object.

A group of Columns where all ColumnFamily members use the same prefix.
NOSQL:Cassandra and NOSQL:HBASE are member columns of the NOSQL column family.
The column family prefix must consist of displayable characters.
Part of the table schema definition must be specified first.
All column family members are physically stored together in the file system.
New column family members can be added dynamically.

Cell

A Cell is a data unit consisting of Column family, Row key, and Column qualifier. The value of each column is called a cell.

A data cell includes a value and a timestamp, which is the version of the value. Because there is a timestamp, previous values are stored together and maintained for a certain period.

A tuple with ROW KEY, Column, and Version specified.
The value is an arbitrary byte array and Timestamp.
Table cells are versioned, only cells.

Timestamp

Versions are assigned to values where the same data is stored in a Cell. Each version is identified by a version number assigned during creation time. If Timestamp is not mentioned while writing data, the current time is assigned.

HBase data types

Apache HBase has no concept of data types. Everything is a byte array. When values are inserted, it is a kind of byte-in and byte-out database converted to byte arrays using the Put and Result interfaces. Apache HBase uses a serialization framework to convert user data into byte arrays.

Apache HBase cells can store values up to 10 to 15 MB. If a value is larger, it can be stored in Hadoop HDFS and file path metadata information can be stored in Apache HBase.

HBase data storage

The following introduces how Apache HBase is physically stored.

Conceptual view

You can see that a table is represented as a series of Rows at the conceptual level.

The following is a conceptual diagram of how data is stored in HBase.

Conceptual view

Physical view

In the physical view, a table is physically stored by column family.

The following example shows a table that will be stored as a column-family-based table.

Physical view

Namespace

A namespace is a logical group of tables. It is similar to grouping related tables in a relational database.

The representation of a namespace is as follows.

HBase Namespaces
- Table
- Region Server Group
- Permission
- Quota

Let’s look at each component of namespace space.

Table

All tables are part of a namespace. If no namespace is defined, the table is assigned to the default namespace.

RegionServer group

A default RegionServer group can be assigned to a namespace. In this case, created tables become members of the RegionServer group.

Permission

Using namespaces, users can define access control lists such as read, delete, and update permissions, and with write permission, users can create tables.

Quota

This component is used to define the quota that a namespace can include for tables and Regions.

Predefined namespaces

There are two predefined special namespaces.

hbase: A system namespace used to include HBase internal tables.
default: This namespace is for all tables where no namespace is defined.

Data model operations

The main data model operations are Get, Put, Scan, and Delete. These operations can be used to read, write, and delete records from tables.

Let’s look at each operation in detail.

Get

The Get operation is similar to the Select statement in a relational database. It is used to fetch the contents of an HBase table.

You can run the Get command in the HBase shell as follows.

hbase(main) :001:0> get 'table name', 'row key' <filters>

Put

The Put operation is used to write data to a table.

Scan

The Scan operation is used to read multiple rows from a table. It differs from Get, where the set of rows to read must be specified. Scan can iterate over a row range or all rows in a table.

Delete

The Delete operation is used to delete a row or set of rows from an HBase table. It can be executed through HTable.delete().

When a delete command is executed, it is marked as deleted, and when compaction occurs, the row is finally deleted from the table.

Internal delete types are as follows.

Delete: Used for a specific version of a column.
Delete column: Can be used for all column versions.
Delete family: Used for all columns in a specific ColumnFamily.