HBase Tutorials for Beginners
HBase is an open source, distributed database, developed by Apache Software foundation.
Initially, it was Google Big Table, afterwards it was re-named as HBase and is primarily written in Java.
HBase can store massive amounts of data from terabytes to petabytes.
HBase Unique Features
Tutorial | HBase Architecture, Data Flow, and Use cases |
Tutorial | How to Download & Install Hbase |
Tutorial | HBase Shell and General Commands |
Tutorial | Create, Insert, Read Tables in HBase |
Tutorial | HBase: Limitations, Advantage & Problems |
Tutorial | HBase Troubleshooting |
Tutorial | HBase Vs Hive |
Tutorial | Hbase Interview Questions & Answers |
A table for a popular web application may consist of billions of rows. If we want to search particular row from such huge amount of data, HBase is the ideal choice as query fetch time in less. Most of the online analytics applications uses HBase.
Traditional relational data models fail to meet performance requirements of very big databases. These performance and processing limitations can be overcomed by HBase.
In big data analytics, Hadoop plays a vital role in solving typical business problems by managing large data sets and gives best solutions in analytics domain.
In Hadoop ecosystem, each component plays its unique role for the
In terms of storing unstructured, semi-structured data storage as well as retrieval of such data's, relational databases are less useful. Also, fetching results by applying query on huge data sets that are stored in Hadoop storage is a challenging task. NoSQL storage technologies provide the best solution for faster querying on huge data sets.
Some of the NoSQL models present in the market are Cassandra, MongoDB, and CouchDB. Each of these models has different ways of storage mechanism.
For example, MongoDB is a document-oriented database from NoSQL family tree. Compared to traditional databases it provides best features in terms of performance, availability and scalability. It is an open source document-oriented database, and it's written in C++.
Cassandra is also a distributed database from open source Apache software which is designed to handle a huge amount of data stored across commodity servers. Cassandra provides high availability with no single point of failure.
While, CouchDB is a document-oriented database in which each document fields are stored in key-value maps.
HBase storage model is different from other NoSQL models discussed above. This can be stated as follow
MongoDB, CouchDB, and Cassandra are of NoSQL type databases that are feature specific and used as per their business needs. Here, we have listed out different NoSQL database as per their use case.
Data Base Type Based on Feature | Example of Database | Use case (When to Use) |
Key/ Value | Redis, MemcacheDB | Caching, Queue-ing, Distributing information |
Column Oriented | Cassandra, HBase | Scaling, Keeping Unstructured, non-volatile |
Document Oriented | MongoDB, Couchbase | Nested Information, JavaScript friendly |
Graph Based | OrientDB, Neo4J | Handling Complex relational information. Modeling and Handling classification. |
Telecom Industry
Problem Statement:
Solution:
HBase is used to store billions of rows of call detailed records. If 20TB of data is added per month to the existing RDBMS database, performance will deteriorate. To handle a large amount of data in this use case, HBase is the best solution. HBase performs fast querying and display records.
Banking Industry
Problem Statement:
The Banking industry generates millions of records on a daily basis. In addition to this, banking industry also needs analytics solution that can detect Fraud in money transactions.
Solution:
To store, process and update huge volumes of data and performing analytics, an ideal solution is - HBase integrated with several Hadoop eco system components.
That apart, HBase can be used -
Summary:-
HBase provides unique features and will solve typical industrial use cases. As a column-oriented storage, it provides fast querying, fetching of results and high amount of data storage.
HBase is an open-source, column-oriented distributed database system in Hadoop environment.Apache HBase is needed for real-time Big Data applications. The tables present in HBase consists of billions of rows having millions of columns.
Hbase is built for low latency operations, which is having some specific features compared to traditional relational models.
In this tutorial- you will learn,
HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As shown below, HBase has RowId, which is the collection of several column families that are present in the table.
The column families that are present in the schema are key-value pairs. If we observe in detail each column family having a multiple numbers of columns. The column values stored in to disk memory. Each cell of the table has its own Meta data like time stamp and other information.
Coming to HBase the following are the key terms representing table schema
Column-oriented and Row oriented storages
Column and Row oriented storages differ in their storage mechanism. As we all know traditional relational models store data in terms of row-based format like in terms of rows of data. Column-oriented storages store data tables in terms of columns and column families.
The following Table gives some key differences between these two storages
Column-oriented Database | Row oriented Database |
|
|
|
|
What HBase consists of
HBase consists of following elements,
HBase architecture consists mainly of four components
HMaster is the implementation of Master server in HBase architecture. It acts like monitoring agent to monitor all Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background threads.
The following are important roles performed by HMaster in HBase.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.
The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region servers which are going to store all the log files.
When Region Server receives writes and read requests from the client, it assigns the request to a specific region, where actual column family resides. However, the client can directly contact with HRegion servers, there is no need of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires HMaster help when operations related to metadata and schema changes are required.
HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that is present in distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster.
HMaster can get into contact with multiple HRegion servers and performs the following functions.
HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile.
Write and Read operations
The Read and Write operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that it flushes into HFile. The main reason for using Memstore is to store data in Distributed file system based on Row Key. Memstore will be placed in Region server main memory while HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as shown from top to bottom in below table.
Table | HBase table present in the HBase cluster |
Region | HRegions for the presented tables |
Store | It store per ColumnFamily for each region for the table |
Memstore |
|
StoreFile | StoreFiles for each store for each region for the table |
Block | Blocks present inside StoreFiles |
In Hbase, Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate with regions, the servers client has to approach ZooKeeper first.
It is an open source project, and it provides so many important services.
Services provided by ZooKeeper
Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The client needs access to ZK(zookeeper) quorum configuration to connect with master and region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts to repair the failed nodes.
HDFS:-
HDFS is Hadoop distributed file system, as the name implies it provides distributed environment for the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are replicated across Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes to the cluster and performing processing & storing by using the cheap commodity hardware, it will give client better results as compared to existing one.
In here, the data stored in each block replicates into 3 nodes any in case when any node goes down there will be no loss of data, it will have proper backup recovery mechanism.
HDFS get in contact with the HBase components and stores large amount of data in distributed manner.
HBase runs on top of HDFS and Hadoop. Some key differences between HDFS and HBase are in terms of data operations and processing.
HBASE |
HDFS |
|
|
|
|
|
|
|
|
Some typical IT industrial applications use HBase operations along with Hadoop. Applications include stock exchange data, online banking data operations, and processing Hbase is best-suited solution method.
Conclusion:-
Hbase is one of NoSql column-oriented distributed database available in apache foundation. HBase gives more performance for retrieving fewer records rather than Hadoop or Hive. It's very easy to search for given any input value because it supports indexing, transactions, and updating.
We can perform online real-time analytics using Hbase integrated with Hadoop eco system. It has an automatic and configurable sharding for datasets or tables, and provides restful API's to perform the MapReduce jobs.
HBase can be installed in three modes. The features of these modes are mentioned below.
For Hadoop installation Refer this URL Here
In this tutorial- you will learn,
Step 1) Go to the link here to download HBase. It will open a webpage as shown below.
Step 2) Select stable version as shown below 1.1.2 version
Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an installation location.
Installation is performed on Ubuntu with Hadoop already installed.
Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser
Step 2) Unzip it by executing command $tar -xvf hbase-1.1.2-bin.tar.gz. It will unzip the contents, and it will create hbase-1.1.2 in the location /home/hduser
Step 3) Open hbase-env.sh as below and mention JAVA_HOME path in the location.
Step 4) Open ~/.bashrc file and mention HBASE_HOME path as shown in below
export HBASE_HOME=/home/hduser/hbase-1.1.1 export PATH= $PATH:$HBASE_HOME/bin |
Step 5) Open hbase-site.xml and place the following properties inside the file
hduser@ubuntu$ gedit hbase-site.xml(code as below)
<property> <name>hbase.rootdir</name> <value>file:///home/hduser/HBASE/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/HBASE/zookeeper</value> </property>
Here we are placing two properties
All HMaster and ZooKeeper activities point out to this hbase-site.xml.
Step 6) Open hosts file present in /etc. location and mention the IPs as shown in below.
Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below.
And we can check by jps command to see HMaster is running or not.
Step8) HBase shell can start by using "hbase shell" and it will enter into interactive shell mode as shown in below screenshot. Once it enters into shell mode, we can perform all type of commands.
The standalone mode does not require Hadoop daemons to start. HBase can run independently.
This is another method for Hbase Installation, known as Pseudo Distributed mode of Installation. Below are the steps to install HBase through this method.
Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser
Step 2) Unzip it by executing command$tar -xvf hbase-1.1.2-bin.tar.gz. It will unzip the contents, and it will create hbase-1.1.2 in the location /home/hduser
Step 3) Open hbase-env.sh as following below and mention JAVA_HOME path and Region servers' path in the location and export the command as shown
Step 4) In this step, we are going to open ~/.bashrc file and mention the HBASE_HOME path as shown in screen-shot.
Step 5) Open HBase-site.xml and mention the below properties in the file.(Code as below)
<property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/hbase/zookeeper</value> </property>
In the fully distributed mode, multiple data nodes present so we can increase replication by placing more than 1 value in the dfs.replication property
Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below
Here first you have to start Hadoop daemons by using"./start-all.sh" command as shown in below.
After starting Hbase daemons by hbase-start.sh
Now check jps
After successful installation of HBase on top of Hadoop, we get an interactive shell to execute various commands and perform several operations. Using these commands, we can perform multiple operations on data-tables that can give better data storage efficiencies and flexible interaction by the client.
We can interact with HBase in two ways,
In HBase, interactive shell mode is used to interact with HBase for table operations, table management, and data modeling. By using Java API model, we can perform all type of table and data operations in HBase. We can interact with HBase using this both methods.
The only difference between these two is Java API use java code to connect with HBase and shell mode use shell commands to connect with HBase.
Quick overcap of HBase before we proceed-
For examples,
In this tutorial- you will learn,
In Hbase, general commands are categorized into following commands
To get enter into HBase shell command, first of all, we have to execute the code as mentioned below
Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With the help of these commands, we can perform all type of table operations in the HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Syntax:-hbase(main):001:0>status
This command will give details about the system status like a number of servers present in the cluster, active server count, and average load value. You can also pass any particular parameters depending on how detailed status you want to know about the system. The parameters can be 'summary', 'simple', or 'detailed', the default parameter provided is "summary".
Below we have shown how you can pass different parameters to the status command.
hbase(main):002:0>status 'simple'
hbase(main):003:0>status 'summary'
hbase(main):004:0> status 'detailed'
If we observe the below screen shot, we will get a better idea.
Syntax: hbase(main):001:0>status
When we execute this command status, it will give information about number of server's present, dead servers and average load of server, here in screenshot it shows the information like- 1 live server, 1 dead servers, and 7.0000 average load.
Syntax:hbase(main):005:0> version
Syntax: hbase(main) :007:0>table_help
This command guides
Syntax: hbase(main):006:0> Whoami
This command "whoami" is used to return the current HBase user information from the HBase cluster.
It will provide information like
In HBase, Column families can be set to time values in seconds using TTL. HBase will automatically delete rows once the expiration time is reached. This attribute applies to all versions of a row – even the current version too.
The TTL time encoded in the HBase for the row is specified in UTC. This attribute used with table management commands.
Important differences between TTL handling and Column family TTLs are below
These commands will allow programmers to create tables and table schemas with rows and column families.
The following are Table Management commands
Let us look into various command usage in HBase with an example.
Syntax: hbase> create <tablename>, <columnfamilyname>
Example:-
hbase(main):001:0> create 'education' ,'guru99'
0 rows(s) in 0.312 seconds
=>Hbase::Table – education
The above example explains how to create a table in HBase with the specified name given according to the dictionary or specifications as per column family. In addition to this we can also pass some table-scope attributes as well into it.
In order to check whether the table 'education' is created or not, we have to use the "list" command as mentioned below.
Syntax:hbase(main):001:0>list
Syntax: hbase>describe <table name>
This command describes the named table.
Syntax: hbase>disable <tablename>
Here, in the above screenshot we are disabling table education
Syntax:- hbase>disable_all<"matching regex"
Syntax: hbase>enable <tablename>
Syntax:hbase>show_filters
This command displays all the filters present in HBase like ColumnPrefix Filter, TimestampsFilter, PageFilter, FamilyFilter, etc.
Syntax:hbase>drop <table name>
We have to observe below points for drop command
Syntax:Hbase>drop_all<"regex">
Syntax:hbase>is_enabled 'education'
This command will verify whether the named table is enabled or not. Usually, there is a little confusion between "enable" and "is_enabled" command action, which we clear here
Syntax:- hbase> alter <tablename>, NAME=><column familyname>, VERSIONS=>5
This command alters the column family schema. To understand what exactly it does, we have explained it here with an example.
Examples:
In these examples, we are going to perform alter command operations on tables and on its columns. We will perform operations like
hbase> alter 'education', NAME=>'guru99_1', VERSIONS=>5
hbase> alter 'edu', 'guru99_1', {NAME => 'guru99_2', IN_MEMORY => true}, {NAME => 'guru99_3', VERSIONS => 5}
Use one ofthese commands below,
hbase> alter 'education', NAME => 'f1', METHOD => 'delete'
hbase> alter 'education', 'delete' =>' guru99_1'
Syntax: hbase(main):002:0> alter <'tablename'>, MAX_FILESIZE=>'132545224'
Step 1) You can change table-scope attributes like MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc. These can be put at the end;for example, to change the max size of a region to 128MB or any other memory value we use this command.
Usage:
NOTE: MAX_FILESIZE Attribute Table scope will be determined by some attributes present in the HBase. MAX_FILESIZE also come under table scope attributes.
Step 2) You can also remove a table-scope attribute using table_att_unset method. If you see the command
Syntax: hbase(main):003:0> alter 'education', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'
Syntax: hbase>alter_status 'education'
These commands will work on the table related to data manipulations such as putting data into a table, retrieving data from a table and deleting schema, etc.
The commands come under these are
Let look into these commands usage with an example.
Syntax: hbase> count <'tablename'>, CACHE =>1000
Example:
This example count fetches 1000 rows at a time from "Guru99" table.
We can make cache to some lower value if the table consists of more rows.
But by default it will fetch one row at a time.
hbase> count 'guru99', INTERVAL =>10, CACHE=> 1000
If suppose if the table "Guru99" having some table reference like say g.
We can run the count command on table reference also like below
hbase>g.count INTERVAL=>10, CACHE=>1000
Syntax: hbase> put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
Example:
hbase> put 'guru99', 'r1', 'c1', 'value', 10
Suppose if the table "Guru99" having some table reference like say g. We can also run the command on table reference also like hbase>g.put 'guru99', 'r1', 'c1', 'value', 10
The output will be as shown in the above screen shot after placing values into "guru99".
To check whether the input value is correctly inserted into the table, we use "scan" command. In the below screen shot, we can see the values are inserted correctly
Code Snippet: For Practice
create 'guru99', {NAME=>'Edu', VERSIONS=>213423443}
put 'guru99', 'r1', 'Edu:c1', 'value', 10
put 'guru99', 'r1', 'Edu:c1', 'value', 15
put 'guru99', 'r1', 'Edu:c1', 'value', 30
From the code snippet, we are doing these things
Syntax: hbase> get <'tablename'>, <'rowname'>, {< Additional parameters>}
Here <Additional Parameters> include TIMERANGE, TIMESTAMP, VERSIONS and FILTERS.
By using this command, you will get a row or cell contents present in the table. In addition to that you can also add additional parameters to it like TIMESTAMP, TIMERANGE,VERSIONS, FILTERS, etc. to get a particular row or cell content.
Examples:-
For table "guru99' row r1 and column c1 values will display using this command as shown in the above screen shot
For table "guru99"row r1 values will be displayed using this command
For table "guru99" row r1 and column families' c1, c2, c3 values will be displayed using this command
Syntax: hbase> delete <'tablename'>,<'row name'>,<'column name'>
Example:
Syntax: hbase>deleteall <'tablename'>, <'rowname'>
Example:-
hbase>deleteall 'guru99', 'r1', 'c1'
hbase>deleteall 'guru99', 'r1', 'c1',
This will delete all the rows and columns present in the table. Optionally we can mention column names in that.
Syntax: hbase> truncate <tablename>
After truncate of an hbase table, the schema will present but not the records. This command performs 3 functions; those are listed below
Syntax: hbase>scan <'tablename'>, {Optional parameters}
This command scans entire table and displays the table contents.
Examples:-
The different usages of scan command
Command |
Usage |
hbase> scan '.META.', {COLUMNS => 'info:regioninfo'} |
It display all the meta data information related to columns that are present in the tables in HBase |
hbase> scan 'guru99', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'} |
It display contents of table guru99 with their column families c1 and c2 limiting the values to 10 |
hbase> scan 'guru99', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]} |
It display contents of guru99 with its column name c1 with the values present in between the mentioned time range attribute value |
hbase> scan 'guru99', {RAW => true, VERSIONS =>10} |
In this command RAW=> true provides advanced feature like to display all the cell values present in the table guru99 |
hbase(main):016:0> scan 'guru99'
The output as below shown in screen shot
In the above screen shot
Code Snippet:-
First create table and place values into table
create 'guru99', {NAME=>'e', VERSIONS=>2147483647}
put 'guru99', 'r1', 'e:c1', 'value', 10
put 'guru99', 'r1', 'e:c1', 'value', 12
put 'guru99', 'r1', 'e:c1', 'value', 14
delete 'guru99', 'r1', 'e:c1', 11
Input Screenshot:
If we run scan command Query:hbase(main):017:0> scan 'guru99', {RAW=>true, VERSIONS=>1000}
It will display output shown in below.
Output screen shot:
The output shown in above screen shot gives the following information
Command |
Functionality |
add_peer |
Add peers to cluster to replicate hbase> add_peer '3', zk1,zk2,zk3:2182:/hbase-prod |
remove_peer |
Stops the defined replication stream. Deletes all the metadata information about the peer hbase> remove_peer '1' |
start_replication |
Restarts all the replication features hbase> start_replication |
stop_replication |
Stops all the replication features hbase>stop_replication |
Summary:
HBase shell and general commands give complete information about different type of data manipulation, table management, and cluster replication commands. We can perform various functions using these commands on tables present in HBase.
Hbase is a column oriented NoSql database for storing a large amount of data on top of Hadoop eco system. Handling tables in Hbase is a very crucial thing because all important functionalities such as Data operations, Data enhancements and Data modeling we can perform it through tables only in HBase.
Handling tables performs the following functions
In HBase, we can perform table operations in two ways
We already have seen how we can perform shell commands and operations in HBase. In this tutorial, we are going to perform some of the operations using Java coding through Java API.
Through Java API, we can create tables in HBase and also load data into tables using Java coding.
In this tutorial - we will learn,
In this section, we are going to create tables with column families and rows by
Establishing connection through Java API:
The Following steps guide us to develop Java code to connect HBase through Java API.
Step 1) In this step, we are going to create Java project in eclipse for HBase connection.
Creation of new project name "HbaseConnection" in eclipse.
For Java related project set up or creation of program: Refer /java-tutorial.html
If we observe the screen shot above.
Step 2) On eclipse home page follow the following steps
Right click on project -> Select Build Path -> Configure build path
From above screenshot
After clicking Configure Build path, it will open another window as shown in below screen shot
In this step, we will add relevant HBase jars into java project as shown in the screenshot.
After adding these jars, it will show under project "src" location. All the Jar files that fall under the project are now ready for usage with Hadoop ecosystem.
Step 3) In this step by using HBaseConnection.java, the HBase Connection would be established through Java Coding
From screen shot above we are performing following functions.
The below coding is going to
Code Placed under HBaseConnection_Java document
// Place this code inside Hbase connection import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; Import org.apache.hadoop.hbase.client.HBaseAdmin; public class HBaseConnection { public static void main(String[] args) throws IOException { HBaseConfigurationhc = new HBaseConfiguration(new Configuration()); HTableDescriptorht = new HTableDescriptor("guru99"); ht.addFamily( new HColumnDescriptor("education")); ht.addFamily( new HColumnDescriptor("projects")); System.out.println( "connecting" ); HBaseAdminhba = new HBaseAdmin( hc ); System.out.println( "Creating Table" ); hba.createTable( ht ); System.out.println("Done......"); } }
This is required code you have to place in HBaseConnection.java and have to run java program
After running this program, it is going to establish a connection with HBase and in turn it will create a table with column names.
Step 4) We can check whether "guru99" table is created with two columns in HBase or not by using HBase shell mode with "list" command.
The "list" command gives information about all the tables that is created in HBase.
Refer "HBase Shell and General Commands" article for more information on "list" command.
In this screen, we going to do
In this section, we are going to
For example, we will
Here is the Java Code to be placed under HBaseLoading.java as shown below for both writing and retrieving data.
Code Placed under HBaseLoading_Java document
import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class HBaseLoading { public static void main(String[] args) throws IOException { // When you create a HBaseConfiguration, it reads in whatever you've set into your hbase-site.xml and in hbase-default.xml, as long as these can be found on the CLASSPATH org.apache.hadoop.conf.Configurationconfig = HBaseConfiguration.create(); //This instantiates an HTable object that connects you to the "test" table HTable table = newHTable(config, "guru99"); // To add to a row, use Put. A Put constructor takes the name of the row you want to insert into as a byte array. Put p = new Put(Bytes.toBytes("row1")); //To set the value you'd like to update in the row 'row1', specify the column family, column qualifier, and value of the table cell you'd like to update. The column family must already exist in your table schema. The qualifier can be anything. p.add(Bytes.toBytes("education"), Bytes.toBytes("col1"),Bytes.toBytes("BigData")); p.add(Bytes.toBytes("projects"),Bytes.toBytes("col2"),Bytes.toBytes("HBaseTutorials")); // Once you've adorned your Put instance with all the updates you want to make, to commit it do the following table.put(p); // Now, to retrieve the data we just wrote. Get g = new Get(Bytes.toBytes("row1")); Result r = table.get(g); byte [] value = r.getValue(Bytes.toBytes("education"),Bytes.toBytes("col1")); byte [] value1 = r.getValue(Bytes.toBytes("projects"),Bytes.toBytes("col2")); String valueStr = Bytes.toString(value); String valueStr1 = Bytes.toString(value1); System.out.println("GET: " +"education: "+ valueStr+"projects: "+valueStr1); Scan s = new Scan(); s.addColumn(Bytes.toBytes("education"), Bytes.toBytes("col1")); s.addColumn(Bytes.toBytes("projects"), Bytes.toBytes("col2")); ResultScanner scanner = table.getScanner(s); try { for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { System.out.println("Found row : " + rr); } } finally { // Make sure you close your scanners when you are done! scanner.close(); } } }
First of all, we are going to see how to write data, and then we will see how to read data from an hbase table.
Step 1) In this step, we are going to write data into HBase table "guru99"
First we have to write code for insert and retrieve values from HBase by using-HBaseLoading.java program.
For creating and inserting values into a table at the column level, you have to code like below.
From the screen shot above
Step 2) Whatever the values that we placed in HBase tables in Step (1) , here we are going to fetch and display those values.
For retrieving results stored in "guru99"
The above screen shot shows the data is being read from HBase table 'guru99'
Once writing code is done, you have to run java application like this
In this section, we will check the following
From the above screen shot, we will get these points
Summary:
As we discussed in this article now, we are familiar with how to create tables, loading data into tables and retrieving data from table using Java API through Java coding. We can able to perform all type of shell command functionalities through this Java API. It will establish good client communication with HBase environment.
In our next article, we will see trouble-shooting for the HBase problems.
HBase architecture always has "Single Point Of Failure" feature, and there is no exception handling mechanism associated with it.
Problems with HBase
Advantage of HBase:
Limitations with HBase:
Cause:
Solution:
What to change:
Open /etc./hosts and go to this location
127.0.0.1 fully.qualified.regionservernameregionservernamelocalhost.localdomain localhost : : 1 localhost3.localdomain3 localdomain3 |
Modify the above configuration like below (remove region server name as highlighted above)
127.0.0.1 localhost.localdomainlocalhost : : 1 localhost3.localdomain3 localdomain3 |
Cause:
Solution:-
Cause:
1) Root directory not to exist
2) HBase previous running instance initialized before
Solution:
Step 1) Using Hadoop dfs to delete the HBase root directory
Step 2) HBase creates and initializes the directory by itself
Cause:
The following shows the exception thrown because of Zookeeper expired event
The highlighted events are some of the exceptions occurred in log file
Log files code as display below:
WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT] INFO org.apache.zookeeper.ClientCnxn: Server connection successful WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired |
Solution:
<property> <name> zookeeper.session.timeout </name> <value>1200000</value> </property> <property> <name> hbase.zookeeper.property.tickTime </name> <value>6000</value> </property>
Hive is an open source data warehouse which is initially developed by Facebook for analysis and querying datasets but is now under Apache software foundation.
Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data is stored in HDFS.
Hive is useful for performing operations like data encapsulation, ad-hoc queries, & analysis of huge datasets. Hive's design reflects its targeted use as a system for managing and querying structured data.
Features | HBase | Hive |
Data base model | Wide Column store | Relational DBMS |
Data Schema | Schema- free | With Schema |
SQL Support | No | Yes it uses HQL(Hive query language) |
Partition methods | Sharding | Sharding |
Consistency Level | Immediate Consistency | Eventual Consistency |
Secondary indexes | No | Yes |
Replication Methods | Selectable replication factor | Selectable replication factor |
While comparing HBase with Traditional Relational databases, we have to take three key areas into consideration. Those are data model, data storage, and data diversity.
HBASE | RDBMS |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1) Explain what is Hbase?
Hbase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). Hbase is not a relational data store, and it does not support structured query language like SQL.
In Hbase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
2) Explain why to use Hbase?
3) Mention what are the key components of Hbase?
4) Explain what does Hbase consists of?
5) Mention how many operational commands in Hbase?
Operational command in Hbases is about five types
6) Explain what is WAL and Hlog in Hbase?
WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.
7) When you should use Hbase?
8) In Hbase what is column families?
Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.
9) Explain what is the row key?
Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
10) Explain deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
11) Explain how does Hbase actually delete a row?
In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.
Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.
12) Explain what happens if you alter the block size of a column family on an already occupied database?
When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.
13) Mention the difference between Hbase and Relational Database?
Hbase | Relational Database |
|
|