Apache HBase Regions
Apache HBase stores large amounts of data in tables sorted by row key in lexicographic order. Tables are distributed across multiple Regions, and Regions are further distributed across multiple RegionServers. When you create a table in Apache HBase, a default Region is assigned.
Apache HBase Regions are horizontally scalable. They include a start key and an end key, and rows are sorted and stored in a continuous format based on those keys. Because HBase provides strong consistency, it does not store the same row key in multiple Regions. Regions distribute load across multiple RegionServers and also perform load balancing and failover according to requirements. As data grows, Regions are split manually or automatically.
Number of Regions
It is recommended to use a small number of medium to large Regions, 5 to 20 GB, per RegionServer, usually 20 to 200 Regions. The standard number of Regions is 100.
Some factors to consider for Regions are as follows.
- The limiting factor when choosing the number of Regions is available heap space. MemStore requires almost 2 MB per column family per Region. If there are 100 Regions and 3 column families per Region, MemStore heap space requirements are 600 MB. Fewer Regions require less MemStore heap.
- A large number of Regions creates many small flushes. If each flush creates a StoreFile, many StoreFiles are created, requiring more compaction. MemStore and StoreFile indexes also require more heap space.
- A large number of Regions creates load on the master because the master must assign and reassign Regions to RegionServers. The master must also move Regions for load balancing.
Region Assignment
In Apache HBase, Regions are assigned by the master as follows.
- The master starts the assignment manager.
- The assignment manager checks existing assignments in hbase:meta metadata.
- Assignments are stored when RegionServers are available.
- If a RegionServer is not online, the load balancer is called to assign Regions to another RegionServer.
- The hbase:meta metadata is updated with the new assignment.
Region failover
Because a RegionServer can fail and multiple RegionServers provide data, a RegionServer’s Regions can become unavailable. ZooKeeper detects RegionServer failures, and the master starts failover on another RegionServer with a similar Row Key for the Region.
Region Locality
Region Locality indicates the proximity of a Region to a RegionServer. It is achieved through HDFS block replication across the cluster.
The replica placement policy is the HDFS replica placement policy, as follows.
- The first replica is placed on the local node.
- The second replica is placed on a random node in a different rack.
- The third replica is placed in the same rack as the second replica, but on a different node.
Benefits of Regions
The benefits of Regions include distributed data storage, partitioning, automatic sharding and scalability, and Region splitting.
Let’s look at each in detail.
Distributed data store
The design of a distributed data store matches Apache HBase’s design of using multiple Regions for a table. Distributing Regions of larger tables across a cluster of nodes provides high availability.
Partitioning
Data tables are stored in multiple Regions, and Regions partition the data. To access data in that table, access must now happen from different Regions. Having multiple Regions provides the benefit of delivering data quickly.
Automatic sharding and scalability
The automatic sharding process is used to split a Region into roughly two halves when the number of row keys in the Region becomes too large. In HBase, the basic unit of horizontal scalability is the Region, and rows are shared by Region.
Region splitting
When a threshold is exceeded, a Region is split. This is handled by the RegionServer, which splits the Region and takes the split Region offline. After that, two split Regions are added to hbase:meta, opened by the RegionServer, and reported to the master. Region splitting is automatic by default, but it can also be run manually. The HBase Region split policy is configured with hbase.regionserver.region.split.policy.