HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. There exist many great sources which explain details of the architecture or guide you through the installation of your own HBase cluster. But when I started my research I missed a simple overview which answers questions like what are the requirements to HBase, which data structure is used to meet these requirements and how is this data structure integrated into the architecture of HBase. I am convinced that everyone who can answer these questions will find it much easier to understand the details of the HBase architecture, development, configuration and data modeling. This was also my motivation to answer these questions in this blog post.
This blog post gives an introduction to HBase which
- comes to the point
- uses bullet points where possible
- uses intuitive drawings
The goal is that the reader afterwards knows
- why this technology is needed
- the basic concepts of HBase
- how it can be used
- which architecture is used
- how the underlying data structure works
- why architecture and data structure fulfill all requirements
Requirements
The requirements for database systems have changed over the past decade with respect to the following factors:
- Volume
- Variety
- Velocity
Some typical applications which produce this kind of data are
- Internet, Social Media
- Natural sciences: Genome Data, Large Hadron Collider (CERN)
- Logistics
- Production
- (And in the future) Internet of things, wearables
Besides the new challenges, it is also important not to forget some classical Database System requirements
- Atomicity
- Consistency
- Isolation
- Durability
The need for HBase
Hadoop is a framework for storing, processing and managing large amounts of data. It has amongst others the the following tools and features
- Resource management
- Fault tolerance
- Distributed file system HDFS
- Large scale batch processing with MapReduce
- Runs on commodity Hardware
- Hadoop is a growing platform with many tools and a good integration into other systems
Out of the box Hadoop can handle a high volume of multi-structured data. But it can not handle a high velocity of random reads and writes and it is unable to change a file without completely rewriting it.
- HBase is a NoSQL database
- It is designed on top of Hadoop, dealing with the drawbacks of HDFS
- It can also be used with other file systems
- HBase allows fast random reads and writes.
- Although HBase allows fast random writes, it is read optimized
Index Data Structure
Requirements to the index data structure
- Fast random reads
- Fast random writes
- Consistency and fail-safety
- Based on HDFS
The problem of Hadoop
- Fast random reads require the data to be stored structured (ordered).
- The only possibility to modify a file stored on HDFS without rewriting is appending.
- Fast random writes into sorted files only by appending seems to be impossible.
- The solution to this problem is the Log-Structured Merge Tree (LSM Tree).
- The HBase data structure is based on LSM Trees
The Log-Structured Merge Tree
The LSM Tree works the following way
- All puts (insertions) are appended to a write ahead log (can be done fast on HDFS, can be used to restore the database in case anything goes wrong)
- An in memory data structure (MemStore) stores the most recent puts (fast and ordered)
- From time to time MemStore is flushed to disk.
This results in the following structure
- This results in a many small files on HDFS.
- HDFS better works with few large files instead of many small ones.
- A get or scan potentially has to look into all small files. So fast random reads are not possible as described so far.
- That is why HBase constantly checks if it is necessary to combine several small files into one larger one
- This process is called compaction. There are two different kinds of compactions.
- Minor Compactions merge few small ordered files into one larger ordered one without touching the data.
- Major Compactions merge all files into one file. During this process outdated or deleted values are removed.
- Guarantees on the maximum number of compactions per entry can be made because of the way HBase triggers compactions.
- Bloom Filters (stored in the Metadata of the files on HDFS) can be used for a fast exclusion of files when looking for a specific key.
Data Model and Properties
HBase uses the following data model
- Every entry in a Table is indexed by a RowKey
- For every RowKey an unlimited number of attributes can be stored in Columns
- There is no strict schema with respect to the Columns. New Columns can be added during runtime
- HBase Tables are sparse. A missing value doesn’t need any space
- Different versions can be stored for every attribute. Each with a different Timestamp.
- Once a value is written to HBase it cannot be changed. Instead another version with a more recentTimestamp can be added.
- To delete a value from HBase a Tombstone value has to be added.
- The Columns are grouped into ColumnFamilies. The ColumnFamilies have to be defined at table creation time and can’t be changed afterwards.
- HBase is a distributed system. It is guaranteed that all values belonging to the same RowKey andColumnFamily are stored together.
Alternatively HBase can also be seen as a sparse, multidimensional, sorted map with the following structure:
- (Table, RowKey, ColumnFamily, Column, Timestamp) → Value
Or in an object oriented way:
- Table ← SortedMap<RowKey, Row>
- Row ← List<ColumnFamily>
- ColumnFamily ← SortedMap<Column, List<Entry>>
- Entry ← Tuple<Timestamp,Value>
HBase supports the following operations:
- Get: Returns the values for a given RowKey. Filters can be used to restrict the results to specific ColumnFamilies, Columns or versions.
- Put: Adds a new entry. The Timestamp can be set automatically or manually.
- Scan: Returns the values for a range of RowKeys. Scans are very efficient in HBase. Filters can also be used to narrow down the results. HBase 0.98.0 (which was released last week) also allows backward scans.
- Delete: Adds a Tombstone marker
Architecture
- HBase is a distributed database
- The data is partitioned based on the RowKeys into Regions.
- Each Region contains a range of RowKeys based on their binary order.
- A RegionServer can contain several Regions.
- All Regions contained in a RegionServer share one write ahead log (WAL).
- Regions are automatically split if they become too large.
- Every Region creates a Log-Structured Merge Tree for every ColumnFamily. That’s why fine tuning like compression can be done on ColumnFamily level. This should be considered when defining the ColumnFamilies.
- HBase uses ZooKeeper to manage all required services.
- The assignment of Regions to RegionServers and the splitting of Regions is managed by a separate service, the HMaster
- The ROOT and the META table are two special kinds of HBase tables which are used for efficiently identifying which RegionServer is responsible for a specific RowKey in case of a read or write request.
- When performing a get or scan, the client asks ZooKeeper where to find the ROOT Table. Then the client asks the ROOT Table for the correct META Table. Finally it can ask the META Table for the the correct RegionServer.
- The client stores information about ROOT and META Tables to speed up future lookups.
- Using these three layers is efficient for a practically unlimited number of RegionServers.
Mission completed?
Does HBase fulfill all “new” requirements?
- Volume: By adding new servers to the cluster HBase scales horizontally to an arbitrary amount of data.
- Variety: The sparse and flexible table structure is optimal for multi-structured data. Only the ColumnFamilies have to be predefined.
- Velocity: HBase scales horizontally to read or write requests of arbitrary speed by adding new servers. The key to this is the LSM-Tree Structure.
What about ACID (Atomicity, Consistency, Isolation, Durability)?
- HBase is not fully ACID.
- ACID guarantees can only be made to changes within the same row.
- Lars Hofhansl has written a nice blog post where he explains how and when ACID can be guaranteed. [1]
It depends on the use case if this is a limitation. For many Big Data use cases it ain’t.
The CAP Theorem?
- A distributed system cannot be consistent, available and tolerant to network partitions at the same time.
- Since every distributed system has to be tolerant to network partitions (communication between the nodes may be disturbed), one has to choose between availability (the system will always accept and process read and write requests) and consistency (an update is applied to all relevant nodes at the same time).
- HBase is partition tolerant and consistent (CP). System failures may result in unprocessed requests.
- Coda Hale wrote a really good article on the CAP theorem [2]
When to use HBase?
HBase should be considered in the following cases
- Existing Hadoop cluster
- Huge amount of data
- Fast random reads and/or writes
- Well known access patterns
Don’t use HBase in the following cases
- New data only needs to be appended
- Batch processing instead of random reads
- Complicated access patterns (such as joins)
- Full ANSI SQL support required
- A single node can deal with the volume and the velocity of the complete data set
How to use HBase?
- The possibilities for SQL querying and data interaction are restricted. Simple SQL access patterns are possible using Hive.
- HBase has a JAVA API
- HBase tables can be used as input to MapReduce jobs (including Pig and Hive)
- HBase is a perfect candidate for the serving layer in a Lambda Architecture which combines real time analytics with batch processing.