A Beginners Guide to HBase
This is an initial goto reference for someone who wants to learn a little bit about HBase and its internals.
Ambari
Ambari console provides a one-stop solution for all your HBase management. First, you’ll have to install the Ambari console on a server. It provides with all the other things that need to be done for managing HBase Cluster including the installation of various components like HBase, Ambari Metrics, HDFS, ZK, etc., adding/removing a node, restarting a node, altering HBase process configs, etc
You can refer to this official doc for all the Ambari How-Tos.
HBase
Overview
Hbase is an append-only, random real-time read/write access to your Big data.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. You can create huge sparce table with HBase
HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
Following are the components of an HBase cluster. An overview of the HBase components is as the below image represents:
HBase Components
Ambari Console
As mentioned above, the Ambari console is for monitoring & managing the HBase cluster. It runs on 8080 port.
Link — http://<server_ip_where_ambari_is_installed>:8080/
Below is the Ambari dashboard UI. On the left-hand side, you can choose b/w HDFS, HBase, Zookeeper, etc depending on what you want to monitor. You can refer to this link for all the Ambari How-Tos.
Ambari Metrics/Grafana
It provides you with all the graphs/metrics to debug any HBase related issues & to ensure that your cluster is in a healthy state.
It runs on 3000 port of the same server where Ambari runs.
Link — http://<server_ip_where_ambari_is_installed>:3000/
Here you can find all the documentation regarding the HBase metrics including what each graph represents.
HBase Client
When the client(the java/python library) makes the connection with the HBase for reading/writing, the client first contacts HBase Master. HBase Master provides the IP Addresses of all the Region Servers & the regions assigned to them. Regions could be random. It also provides the start & end row keys of all the regions.
Now, when the query takes place, the client directly makes the call to the appropriate region server where the data is/should be present. In the case the Hbase HA is enabled, the client also has the knowledge of where the secondary regions servers are and makes the call to them after the query timeout.
When the Region Server is down OR the regions are reassigned to some other region servers, Incorrect Region Exception occurs & client then asks the master to provide the correct information. Then the client makes the correct call & fetches the data.
HBase Master Node
HBase Master Node can be termed as the orchestrator for managing region servers & has all the knowledge of all the regions & region servers. It has the knowledge of region & region-servers mappings along with other essential details.
In the case of region server crash, Master detects the absence of a heartbeat from that region server after the heartbeat timeout & reassigns the regions to other available region servers.
All the communication required b/w region servers & master, to do this, is done by Master.
Master does all this with the help of HBase Zookeeper.
HBase Master Node Dashboard (runs on port 16010) can be accessed via this URL — http://<hbase_master_node_ip>:16010
You can also find the IP by going down the HBase tab from the left menu & selecting Active HBase Master link.
You can find the IP of the master node in the summary section below:
The master node dashboard will provide you with all the info regarding regions & their region servers mapping, how many regions does each server has, requests per sec, info regarding HBase tables, etc.
HBase Zookeeper
This can be considered as the persistence store of HBase Master. Anything that Hbase master does, it keeps the same here.
HBase Region Servers
Region servers are the contact points for any read/write queries. Client directly makes a call to these servers for fetching the data. The client identifies which server it has to make the query on, since, the client has the knowledge where exactly does a region reside, and executes the query on that server. It’ll not make the query on any other server.
Each region server has many regions assigned to it. Any query outside of those regions will result in Incorrect Region Exception.
A region server consists of the following things:
- Memstore
- Block cache
- Write Ahead Log (WAL)
The Region server/region doesn’t hold the actual data. The actual files are stored in HDFS in the form of H-Files. It’s just the mediator between the HFiles & client. The Region Server maintains a cache, called as Block Cache, which holds the recently accessed blocks(data is stored/read in the form of blocks from HFiles).
Similarly, while for write requests, the data is first written in WAL, then in Memstore, and the client returns. The Memstore is then flushed to HDFS, in the form of HFiles, when it gets filled/after a regular interval. After the flush, since the data is now persisted in the HDFS, the WAL before this entry is no longer needed and is hence, deleted.
HBase Region Servers Dashboard (Port 16030) can be accessed via this URL— http://<server_ip_of_region_server>:16030/
You can also find the IP by going down the HBase tab from the left menu & selecting Region Servers link.
This page will list all the region servers will there IP Addresses.
The region server dashboard provides stats about that region server eg: req/sec, block cache size, num. of regions, block locality, etc
HBase Region Server Memstore
Memstore is a place where all the writes take place. Whenever a region server receives an insert/update request, it first appends the WAL then immediately writes in memstore and returns success to the client.
Memstore can be considered as an in-memory sorted HashMap. There is 1 memstore for each region. So, the total number of memstores in a region server should be equal to the total number of regions that region server has. Usually, 1 memstore is of 256 MB.
Normally, we allocate 40% of the total memory to Memstore.
Here is a more detailed info on Memstore.
HBase Region Server Block Cache
Similar to Memstore, the block cache is the read cache of the region. This is also 1 per region. The block cache serves as the first place where read lookups happen. If it is not present there(cache miss), then the lookup happens in the HDFS (main persistent store), stored back in the cache & returned to the client.
Block cache has an LRU based expiration policy. Normally, we allocate 40% of the total memory to Block Cache. And the rest 20% for heap/Hbase/etc
HBase Region Server Write Ahead Log(WAL)
WAL is the first place where the writes take place. After that, the write happens in the memstore. The purpose of WAL is to provide all the writes happened in case the region server crashes & the Memstore data is lost. There is only 1 WAL per region server.
WAL is enabled by default, but you can choose to disable it on a per-query basis.
More reading here.
HBase Region Server L2 Cache/Off-Heap Cache
This is a secondary level cache used by block cache. If this is enabled, the lookup first takes place in this cache, after block cache, before it goes to HDFS.
Another reason for its existence is that GC doesn’t happen here and hence prevents GC pauses. Inside a JVM process, the GC happens only inside the main Heap Memory. The process(in this case — HBase itself) manages the Garbage Collection and beyond the scope of JVM GC.
HDFS
HDFS is the main data store where all the data is stored. HDFS generally stores data in the form of sequence files, but HBase uses a special kind of file structure, known as H Files, which allows the Hbase to access real-time read/write data from the HDFS file store.
HDFS HFile
H File is the final form of persistent data which is present in the HDFS. Whenever a memstore flushing happens, the data is converted in the form of HFile and then flushed down to the HDFS file system. Two processes known as a major compaction & a minor compaction occurs, which combine multiple HFiles into a bigger HFile.
An HFile is immutable.
Very simplistically, an HFile consists of 3 parts:
- An HFile — This is the data in the form of
- A Bloom Filter — To check if the Hfile has the data or not.
- An Index — If the data is present, the block is fetched on the basis of this index
The detailed architecture of the same is here — H Files.
HDFS Name Node
Similar to Hbase’s Master Nodes, these are the main managing servers for HDFS data nodes(which store the actual data). They take care of all the HDFS servers availability, their data replications, whether they are up or not, etc.
HDFS Name Node Dashboard (Port 50070) — http://<server_ip_of_name_node>:50070
Similar to the region server master dashboard, this dashboard provides with all the data related to the HDFS Data nodes, their IPs, server configs, HDFS configs, etc.
HDFS Data Node
A data node is the main server where the actual data resides in the form of HFiles. Usually, there is a replication factor of 3, which means that the data is replicated at 2 more places apart from this node such that there are always 3 copies of any data, for redundancy/fail-safe scenarios.
In the scenario, when an HDFS data node goes down, these copies ensure that the data is available.
HDFS Data Node Dashboard (Port 50075) — http://<server_ip_of_data_node>:50075
This dashboard provides the info of that particular data node, its disk volumes, etc.
HBase Processes
Read Path
In very simple terms, the read path is as follows:
- 1st lookup is memstore if the sought data is not yet dumped in the HDFS (and hence won’t be present in the block cache.)
- 2nd lookup in the Block cache
- 3rd lookup in the off-heap cache (if present)
- 4th lookup in the HDFS data node(H File) — While reading from there, it checks in every HFile, of that region, whether the data is available in that particular HFile via its bloom filter, till it locates the required file. Then the data is accessed with the help of the index of HFile.
- More details here.
Write Path
In very simple terms, the write path is as follows:
- First, the data is written in WAL
- Then the data is written in Memstore
- When the memstore is full the data is flushed to the HDFS in the form of HFiles.
The below diagram is a very apt explanation of the process:
Follow this link for more detailed working of Hbase write path.
Memstore Flushing
When RegionServer (RS) receives write request, it directs the request to a specific Region. Each Region stores set of rows. Rows data can be separated into multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles. Memstore is kept in RS main memory, while HFiles are written to HDFS. When a write request is processed, data is first written into the Memstore. Then, when certain thresholds are met (obviously, main memory is well-limited) Memstore data gets flushed into HFile.
The main reason for using Memstore is the need to store data on HDFS ordered by row key. As HDFS is designed for sequential reads/writes, with no file modifications allowed, HBase cannot efficiently write data to disk as it is being received: the written data will not be sorted (when the input is not sorted) which means not optimized for future retrieval. To solve this problem HBase buffers last received data in memory (in Memstore), “sorts” it before flushing, and then writes to HDFS using fast sequential writes. Note that in reality HFile is not just a simple list of sorted rows, it is much more than that.
Apart from solving the “non-ordered” problem, Memstore also has other benefits, e.g.:
- It acts as an in-memory cache which keeps recently added data. This is useful in numerous cases when last written data is accessed more frequently than older data
- There are certain optimizations that can be done to rows/cells when they are stored in memory before writing to the persistent store. E.g. when it is configured to store one version of a cell for certain CF and Memstore contains multiple updates for that cell, only most recent one can be kept and older ones can be omitted (and never written to HFile).
The important thing to note is that every Memstore flush creates one HFile per CF.
Refer to the following diagram for reference:
- Minor Compaction
Several times during the day, an internal process within HBase initiates this task of merging several smaller files into bigger files. This process is not as intensive as Major Compaction.
You can see if the minor compaction is running or not by visiting the following link(change the IP & table name accordingly) :
— http://<server_ip_of_region_server>:16010/table.jsp?name=<table_name>
Alternatively, you can follow the following steps:
- Open HBase Master node dashboard
- Scroll down to Tables section
- Click on the table name you are interested in.
- See Compaction under Table Attributes Section. It should be either NONE (No Compaction running), MINOR (minor compaction), MAJOR (major compaction) — based on which compaction is running at that time.
Major Compaction
Major compaction merges all the small Hfiles into 1 big HFile per region. Total number of HFiles that remain after this also depends on the config setting which defines the minimum number of files to be present for major compaction to run. If that setting is two, either the major compaction won’t run until there are more than 2 files present or the final output may be 2 files.
Major compaction uses significant resources & affects read/writes and hence it should be done at the time when the load is minimal. It is run by an internal process of HBase.
You can see if the minor compaction is running or not by visiting the following link(change the IP & table name accordingly) :
— http://<server_ip_of_region_server>:16010/table.jsp?name=<table_name>
Alternatively, you can follow the following steps:
- Open HBase Master node
- Scroll down to Tables section
- Click on the table name you are interested in.
- See Compaction under Table Attributes Section. It should be either None (No Compaction running), MINOR (minor compaction), MAJOR (major compaction) — based on which compaction is running at that time.
Major Compaction also corrects Locality. After the compaction, the data(HFile) is moved to the respective data nodes which co-resides with the region servers they are mapped with.
More detailed read here.
Data Locality
The Hadoop DataNode stores the data that the Region Server is managing. All HBase data is stored in HDFS files. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the RegionServers. HBase data is local when it is written, but when a region(happens when a region is split, Hbase process went down — either via stopping/restarting the process or a crash, etc) is moved, it is not local until compaction.
The NameNode maintains metadata information for all the physical data blocks that comprise the files.
At the time of a read query, region server queries HDFS data nodes, which in turn makes a network call for querying the required data if the data is not present locally. ( All this is abstract to Hbase — it doesn’t deal with all this low-level HDFS work.)
Hbase HA Configuration Essentials
For configuring HA(High Availability) in HBase read queries, there are the following 3 things you need to do:
- Make Hbase HA ready
- Make Table HA ready
- Make client HA ready — by giving HA related configs in the client while making connections.
Hbase HA read path
During HA, each region has a secondary region assigned for read queries. This means that the secondary region server would also keep a block cache for the reads of the secondary region. Thus, it’s a fair assumption that we would require double the amount of memory for block cache than the previous amount.
During HA, If the read from the primary region is timed out, the client makes the same call to the secondary region server. This ensures that if the primary region is stuck in some other activity and is not able to respond, we have another server at our disposal to read data from.
HA reads are eventually consistent reads. This means that the data that we get from the secondary region server might be a little old. This happens because, during HA, all the changes in the primary block cache is replicated in the secondary block cache by an internal process. Hence, you’ll get the old data by a factor of the delay in replication.
Further reading Timeline-consistent High Available Reads
Hbase HA write path
There is no HA for writes. Writes happen at the assigned regions itself and if they are getting timed out, you’ll have to handle those cases.
Since HBase is a consistent(CAP) DB and introduction of HA (like in reads) would introduce eventually consistent behaviour, HA for writes is not there.
Hbase add Region Server/Data Node
We can add a new region server via Ambari(Ambari → HBase → Region Servers → Actions → Add New Host). After the VM host is up, we can configure the parameters in Ambari and the Ambari will set up the new host for us.
After the region server is added, the Master redistributes the regions equally in all the available hosts. Also, since the newly added region server would have regions taken from other region servers, the data locality would be zero, and the collocated HDFS would have to make network calls for persistence. Hence, latencies should shoot up a little.
Generally, we add this at low traffic, just before the major compaction so that the localities are fixed after the major compaction is completed.
Hbase remove Region Server/Data Node
Similar to adding the region server, we can decommission a region server via the following steps:
Ambari → HBase → Region Servers → Click on the region server which needs to be removed
Here you’ll find the options to decommission the HDFS & HBase. You can monitor the progress in the NameNode & HBase Master Dashboards. Also, you can check the region server & data node dashboards.
After removal, the regions of that region server are reassigned to the existent region servers. Again the locality is going to suffer until major compaction & in turn, the latencies should shoot up a little.
The localities should go back to normal after major compaction.
- HBase Region Server Crash
In the case of region server crash, the WAL comes to the rescue. Since it contains all the writes that have happened till now which haven’t yet been flushed in HDFS(in the form of HFile), it is replayed over the memstore — just the normal writes happen.
During writing the normal HBase write path is followed. It should be noted that while the WAL is being replayed, the corresponding regions are not available to reads/writes. After the recovery, the regions are available again.
Please refer the below image for reference:
More read here, here & log splitting.
HDFS NameNode HA
Please refer the following docs:
- https://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/
- https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
- https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_hag_hdfs_ha_config.html
References
Though I’ve mentioned the corresponding links for each topic above, I’ve mainly referred to Hortonworks/Cloudera Docs.
Some docs to follow:
I hope this blog was useful for you and I really look forward to suggestions for improvements.
Feel free to comment or reach out over LinkedIn.