HBase Tutorials for Beginners
Posted by Superadmin on January 02 2019 07:03:10

HBase Tutorials for Beginners

 

HBase is an open source, distributed database, developed by Apache Software foundation.

Initially, it was Google Big Table, afterwards it was re-named as HBase and is primarily written in Java.

HBase can store massive amounts of data from terabytes to petabytes.

HBase Unique Features

Here is what we cover in the Tutorial series

  Tutorial HBase Architecture, Data Flow, and Use cases
  Tutorial How to Download & Install Hbase
  Tutorial HBase Shell and General Commands
  Tutorial Create, Insert, Read Tables in HBase
  Tutorial HBase: Limitations, Advantage & Problems
  Tutorial HBase Troubleshooting
 Tutorial HBase Vs Hive
 Tutorial Hbase Interview Questions & Answers

Why to Choose HBase?

A table for a popular web application may consist of billions of rows. If we want to search particular row from such huge amount of data, HBase is the ideal choice as query fetch time in less. Most of the online analytics applications uses HBase.

Traditional relational data models fail to meet performance requirements of very big databases. These performance and processing limitations can be overcomed by HBase.

Importance of NoSQL Databases in Hadoop

In big data analytics, Hadoop plays a vital role in solving typical business problems by managing large data sets and gives best solutions in analytics domain.

In Hadoop ecosystem, each component plays its unique role for the

In terms of storing unstructured, semi-structured data storage as well as retrieval of such data's, relational databases are less useful. Also, fetching results by applying query on huge data sets that are stored in Hadoop storage is a challenging task. NoSQL storage technologies provide the best solution for faster querying on huge data sets.

Other NoSQL storage type Databases

Some of the NoSQL models present in the market are Cassandra, MongoDB, and CouchDB. Each of these models has different ways of storage mechanism.

For example, MongoDB is a document-oriented database from NoSQL family tree. Compared to traditional databases it provides best features in terms of performance, availability and scalability. It is an open source document-oriented database, and it's written in C++.

Cassandra is also a distributed database from open source Apache software which is designed to handle a huge amount of data stored across commodity servers. Cassandra provides high availability with no single point of failure.

While, CouchDB is a document-oriented database in which each document fields are stored in key-value maps.

How HBase different from other NoSQL model

HBase storage model is different from other NoSQL models discussed above. This can be stated as follow

Which NoSQL Database to choose?

MongoDB, CouchDB, and Cassandra are of NoSQL type databases that are feature specific and used as per their business needs. Here, we have listed out different NoSQL database as per their use case.

Data Base Type Based on Feature Example of Database Use case (When to Use)
Key/ Value Redis, MemcacheDB Caching, Queue-ing, Distributing information
Column Oriented Cassandra, HBase Scaling, Keeping Unstructured, non-volatile
Document Oriented MongoDB, Couchbase Nested Information, JavaScript friendly
Graph Based OrientDB, Neo4J Handling Complex relational information. Modeling and Handling classification.

Where is HBase used?

Telecom Industry

Problem Statement:

Solution:

HBase is used to store billions of rows of call detailed records. If 20TB of data is added per month to the existing RDBMS database, performance will deteriorate. To handle a large amount of data in this use case, HBase is the best solution. HBase performs fast querying and display records.

Banking Industry

Problem Statement:

The Banking industry generates millions of records on a daily basis. In addition to this, banking industry also needs analytics solution that can detect Fraud in money transactions.

Solution:

To store, process and update huge volumes of data and performing analytics, an ideal solution is - HBase integrated with several Hadoop eco system components.

That apart, HBase can be used -

Summary:-

HBase provides unique features and will solve typical industrial use cases. As a column-oriented storage, it provides fast querying, fetching of results and high amount of data storage.

 

 

 

HBase is an open-source, column-oriented distributed database system in Hadoop environment.Apache HBase is needed for real-time Big Data applications. The tables present in HBase consists of billions of rows having millions of columns.

Hbase is built for low latency operations, which is having some specific features compared to traditional relational models. 




In this tutorial- you will learn,

Storage Mechanism in Hbase

HBase is a column-oriented database and data is stored in tables. The tables are sorted by RowId. As shown below, HBase has RowId, which is the collection of several column families that are present in the table.

HBase Architecture, Data Flow, and Use cases

The column families that are present in the schema are key-value pairs. If we observe in detail each column family having a multiple numbers of columns. The column values stored in to disk memory. Each cell of the table has its own Meta data like time stamp and other information.

Coming to HBase the following are the key terms representing table schema

Column-oriented and Row oriented storages

Column and Row oriented storages differ in their storage mechanism. As we all know traditional relational models store data in terms of row-based format like in terms of rows of data. Column-oriented storages store data tables in terms of columns and column families.

The following Table gives some key differences between these two storages

Column-oriented Database Row oriented Database
  • When the situation comes to process and analytics we use this approach. Such as Online Analytical Processing and it's applications.
  • Online Transactional process such as banking and finance domains use this approach.
  • The amount of data that can able to store in this model is very huge like in terms of petabytes
  • It is designed for a small number of rows and columns.

What HBase consists of

HBase consists of following elements,

HBase Architecture and its Important Components

HBase Architecture, Data Flow, and Use cases

HBase architecture consists mainly of four components

HMaster:

HMaster is the implementation of Master server in HBase architecture. It acts like monitoring agent to monitor all Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed cluster environment, Master runs on NameNode. Master runs several background threads.

The following are important roles performed by HMaster in HBase.

Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.

The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health status of region servers.

In entire architecture, we have multiple region servers. Hlog present in region servers which are going to store all the log files.

HRegions Servers:

When Region Server receives writes and read requests from the client, it assigns the request to a specific region, where actual column family resides. However, the client can directly contact with HRegion servers, there is no need of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires HMaster help when operations related to metadata and schema changes are required.

HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that is present in distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster.

HMaster can get into contact with multiple HRegion servers and performs the following functions.

HRegions:

HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile.

Data flow in HBase

HBase Architecture, Data Flow, and Use cases

Write and Read operations

The Read and Write operations from Client into Hfile can be shown in below diagram.

Step 1) Client wants to write data and in turn first communicates with Regions server and then regions

Step 2) Regions contacting memstore for storing associated with the column family

Step 3) First data stores into Memstore, where the data is sorted and after that it flushes into HFile. The main reason for using Memstore is to store data in Distributed file system based on Row Key. Memstore will be placed in Region server main memory while HFiles are written into HDFS.

Step 4) Client wants to read data from Regions

Step 5) In turn Client can have direct access to Mem store, and it can request for data.

Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.

Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as shown from top to bottom in below table.

Table HBase table present in the HBase cluster
Region HRegions for the presented tables
Store It store per ColumnFamily for each region for the table
Memstore
  • Memstore for each store for each region for the table
  • It sorts data before flushing into HFiles
  • Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
Block Blocks present inside StoreFiles

ZooKeeper:

In Hbase, Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate with regions, the servers client has to approach ZooKeeper first.

It is an open source project, and it provides so many important services.

Services provided by ZooKeeper

Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The client needs access to ZK(zookeeper) quorum configuration to connect with master and region servers.

During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts to repair the failed nodes.

HDFS:-

HDFS is Hadoop distributed file system, as the name implies it provides distributed environment for the storage and it is a file system designed in a way to run on commodity hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are replicated across Hadoop cluster.

HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By adding nodes to the cluster and performing processing & storing by using the cheap commodity hardware, it will give client better results as compared to existing one.

In here, the data stored in each block replicates into 3 nodes any in case when any node goes down there will be no loss of data, it will have proper backup recovery mechanism.

HDFS get in contact with the HBase components and stores large amount of data in distributed manner.

HBase vs. HDFS

HBase runs on top of HDFS and Hadoop. Some key differences between HDFS and HBase are in terms of data operations and processing.

HBASE

HDFS

  • Low latency operations
  • High latency operations
  • Random reads and writes
  • Write once Read many times
  • Accessed through shell commands, client API in java, REST, Avro or Thrift
  • Primarly accessed through MR (Map Reduce) jobs
  • Storage and process both can be perform
  • It's only for storage areas

Some typical IT industrial applications use HBase operations along with Hadoop. Applications include stock exchange data, online banking data operations, and processing Hbase is best-suited solution method.

Conclusion:-

Hbase is one of NoSql column-oriented distributed database available in apache foundation. HBase gives more performance for retrieving fewer records rather than Hadoop or Hive. It's very easy to search for given any input value because it supports indexing, transactions, and updating.

We can perform online real-time analytics using Hbase integrated with Hadoop eco system. It has an automatic and configurable sharding for datasets or tables, and provides restful API's to perform the MapReduce jobs.

 

 

HBase can be installed in three modes. The features of these modes are mentioned below.

  1. Standalone mode installation (No dependency on Hadoop system)
  1. Pseudo-Distributed mode installation ( Single node Hadoop system + HBase installation)
  1. Fully Distributed mode installation ( MultinodeHadoop environment + HBase installation)

For Hadoop installation Refer this URL Here

In this tutorial- you will learn,

How to Download Hbase tar file stable version

Step 1) Go to the link here to download HBase. It will open a webpage as shown below.

How to Download & Install Hbase

Step 2) Select stable version as shown below 1.1.2 version 

How to Download & Install Hbase

Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an installation location.

How to Download & Install Hbase

Hbase - Standalone mode installation:

Installation is performed on Ubuntu with Hadoop already installed.

Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser 

Step 2) Unzip it by executing command $tar -xvf hbase-1.1.2-bin.tar.gzIt will unzip the contents, and it will create hbase-1.1.2 in the location /home/hduser 

Step 3) Open hbase-env.sh as below and mention JAVA_HOME path in the location. 

How to Download & Install Hbase

Step 4) Open ~/.bashrc file and mention HBASE_HOME path as shown in below 

export HBASE_HOME=/home/hduser/hbase-1.1.1 export PATH= $PATH:$HBASE_HOME/bin

How to Download & Install Hbase

Step 5) Open hbase-site.xml and place the following properties inside the file

hduser@ubuntu$ gedit hbase-site.xml(code as below)

<property>

<name>hbase.rootdir</name>

<value>file:///home/hduser/HBASE/hbase</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hduser/HBASE/zookeeper</value>

</property>

How to Download & Install Hbase

Here we are placing two properties

All HMaster and ZooKeeper activities point out to this hbase-site.xml.

Step 6) Open hosts file present in /etc. location and mention the IPs as shown in below. 

How to Download & Install Hbase

Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below. 

And we can check by jps command to see HMaster is running or not.

How to Download & Install Hbase

Step8) HBase shell can start by using "hbase shell" and it will enter into interactive shell mode as shown in below screenshot. Once it enters into shell mode, we can perform all type of commands. 

How to Download & Install Hbase

The standalone mode does not require Hadoop daemons to start. HBase can run independently.

Hbase - Pseudo Distributed mode of installation:

This is another method for Hbase Installation, known as Pseudo Distributed mode of Installation. Below are the steps to install HBase through this method.

Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser 

Step 2) Unzip it by executing command$tar -xvf hbase-1.1.2-bin.tar.gzIt will unzip the contents, and it will create hbase-1.1.2 in the location /home/hduser 

Step 3) Open hbase-env.sh as following below and mention JAVA_HOME path and Region servers' path in the location and export the command as shown 

How to Download & Install Hbase

Step 4) In this step, we are going to open ~/.bashrc file and mention the HBASE_HOME path as shown in screen-shot. 

How to Download & Install Hbase

Step 5) Open HBase-site.xml and mention the below properties in the file.(Code as below) 

<property>

<name>hbase.rootdir</name>

<value>hdfs://localhost:9000/hbase</value>

</property>

<property>

<name>hbase.cluster.distributed</name>

<value>true</value>

</property>

<property>

<name>hbase.zookeeper.quorum</name>

<value>localhost</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>hbase.zookeeper.property.clientPort</name>

<value>2181</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hduser/hbase/zookeeper</value>

</property>

How to Download & Install Hbase

How to Download & Install Hbase

  1. Setting up Hbase root directory in this property
  2. For distributed set up we have to set this property
  3. ZooKeeper quorum property should be set up here
  4. Replication set up done in this property. By default we are placing replication as 1.

    In the fully distributed mode, multiple data nodes present so we can increase replication by placing more than 1 value in the dfs.replication property

  5. Client port should be mentioned in this property
  6. ZooKeeper data directory can be mentioned in this property

Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below 

Here first you have to start Hadoop daemons by using"./start-all.sh" command as shown in below.

How to Download & Install Hbase

After starting Hbase daemons by hbase-start.sh

How to Download & Install Hbase

Now check jps

How to Download & Install Hbase

Hbase - Fully Distributed mode installation:-

After successful installation of HBase on top of Hadoop, we get an interactive shell to execute various commands and perform several operations. Using these commands, we can perform multiple operations on data-tables that can give better data storage efficiencies and flexible interaction by the client.

We can interact with HBase in two ways,

In HBase, interactive shell mode is used to interact with HBase for table operations, table management, and data modeling. By using Java API model, we can perform all type of table and data operations in HBase. We can interact with HBase using this both methods.

The only difference between these two is Java API use java code to connect with HBase and shell mode use shell commands to connect with HBase.

Quick overcap of HBase before we proceed-

For examples,

In this tutorial- you will learn,

General commands

In Hbase, general commands are categorized into following commands

To get enter into HBase shell command, first of all, we have to execute the code as mentioned below

HBase Shell and General Commands

Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With the help of these commands, we can perform all type of table operations in the HBase shell mode.

Let us look into all of these commands and their usage one by one with an example.

This command will give details about the system status like a number of servers present in the cluster, active server count, and average load value. You can also pass any particular parameters depending on how detailed status you want to know about the system. The parameters can be 'summary', 'simple', or 'detailed', the default parameter provided is "summary".

Below we have shown how you can pass different parameters to the status command. 

hbase(main):002:0>status 'simple'

hbase(main):003:0>status 'summary'

hbase(main):004:0> status 'detailed'

If we observe the below screen shot, we will get a better idea.

Syntax: hbase(main):001:0>status

When we execute this command status, it will give information about number of server's present, dead servers and average load of server, here in screenshot it shows the information like- 1 live server, 1 dead servers, and 7.0000 average load.

HBase Shell and General Commands

Tables Managements commands

These commands will allow programmers to create tables and table schemas with rows and column families.

The following are Table Management commands

Let us look into various command usage in HBase with an example.

This command describes the named table.

Here, in the above screenshot we are disabling table education

HBase Shell and General Commands 

This command alters the column family schema. To understand what exactly it does, we have explained it here with an example. 

Examples:

In these examples, we are going to perform alter command operations on tables and on its columns. We will perform operations like

  1. To change or add the 'guru99_1' column family in table 'education' from current value to keep a maximum of 5 cell VERSIONS,

hbase> alter 'education', NAME=>'guru99_1', VERSIONS=>5

  1. You can also operate the alter command on several column families as well. For example, we will define two new column to our existing table "education". 

hbase> alter 'edu', 'guru99_1', {NAME => 'guru99_2', IN_MEMORY => true}, {NAME => 'guru99_3', VERSIONS => 5}

HBase Shell and General Commands 

  1. In this step, we will see how to delete column family from the table. To delete the 'f1' column family in table 'education'. 

Use one ofthese commands below,

hbase> alter 'education', NAME => 'f1', METHOD => 'delete'

hbase> alter 'education', 'delete' =>' guru99_1'

HBase Shell and General Commands 

  1. As shown in the below screen shots, it shows two steps – how to change table scope attribute and how to remove the table scope attribute.

 Syntax: hbase(main):002:0> alter <'tablename'>, MAX_FILESIZE=>'132545224' 

HBase Shell and General Commands

Step 1) You can change table-scope attributes like MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc. These can be put at the end;for example, to change the max size of a region to 128MB or any other memory value we use this command. 

Usage:

NOTE: MAX_FILESIZE Attribute Table scope will be determined by some attributes present in the HBase. MAX_FILESIZE also come under table scope attributes.

Step 2) You can also remove a table-scope attribute using table_att_unset method. If you see the command 

Syntax: hbase(main):003:0> alter 'education', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE' 

Data manipulation commands

These commands will work on the table related to data manipulations such as putting data into a table, retrieving data from a table and deleting schema, etc.

The commands come under these are

Let look into these commands usage with an example.

Example:

HBase Shell and General Commands

  1. hbase> count 'guru99', CACHE=>1000

This example count fetches 1000 rows at a time from "Guru99" table.

We can make cache to some lower value if the table consists of more rows.

But by default it will fetch one row at a time.

  1. hbase>count 'guru99', INTERVAL => 100000

    hbase> count 'guru99', INTERVAL =>10, CACHE=> 1000

    If suppose if the table "Guru99" having some table reference like say g.

    We can run the count command on table reference also like below

  2. hbase>g.count INTERVAL=>100000

    hbase>g.count INTERVAL=>10, CACHE=>1000

To check whether the input value is correctly inserted into the table, we use "scan" command. In the below screen shot, we can see the values are inserted correctly

HBase Shell and General Commands

Code Snippet: For Practice

create 'guru99', {NAME=>'Edu', VERSIONS=>213423443}

put 'guru99', 'r1', 'Edu:c1', 'value', 10

put 'guru99', 'r1', 'Edu:c1', 'value', 15

put 'guru99', 'r1', 'Edu:c1', 'value', 30

From the code snippet, we are doing these things

By using this command, you will get a row or cell contents present in the table. In addition to that you can also add additional parameters to it like TIMESTAMP, TIMERANGE,VERSIONS, FILTERS, etc. to get a particular row or cell content.

HBase Shell and General Commands

Examples:-

       For table "guru99"row 1 values in the time range ts1 and ts2 will be displayed using this command

Command

Usage

hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}

It display all the meta data information related to columns that are present in the tables in HBase

hbase> scan 'guru99', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}

It display contents of table guru99 with their column families c1 and c2 limiting the values to 10

hbase> scan 'guru99', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

It display contents of guru99 with its column name c1 with the values present in between the mentioned time range attribute value

hbase> scan 'guru99', {RAW => true, VERSIONS =>10}

In this command RAW=> true provides advanced feature like to display all the cell values present in the table guru99

hbase(main):016:0> scan 'guru99'

The output as below shown in screen shot

HBase Shell and General Commands

In the above screen shot

Code Snippet:-

First create table and place values into table

create 'guru99', {NAME=>'e', VERSIONS=>2147483647}

put 'guru99', 'r1', 'e:c1', 'value', 10

put 'guru99', 'r1', 'e:c1', 'value', 12

put 'guru99', 'r1', 'e:c1', 'value', 14

delete 'guru99', 'r1', 'e:c1', 11

Input Screenshot:

HBase Shell and General Commands

If we run scan command Query:hbase(main):017:0> scan 'guru99', {RAW=>true, VERSIONS=>1000}

It will display output shown in below.

Output screen shot:

HBase Shell and General Commands

The output shown in above screen shot gives the following information

Cluster Replication Commands

Command

Functionality

add_peer

Add peers to cluster to replicate

hbase> add_peer '3', zk1,zk2,zk3:2182:/hbase-prod

remove_peer

Stops the defined replication stream.

Deletes all the metadata information about the peer

hbase> remove_peer '1'

start_replication

Restarts all the replication features

hbase> start_replication

stop_replication

Stops all the replication features

hbase>stop_replication

Summary:

HBase shell and general commands give complete information about different type of data manipulation, table management, and cluster replication commands. We can perform various functions using these commands on tables present in HBase.

 

 

Hbase is a column oriented NoSql database for storing a large amount of data on top of Hadoop eco system. Handling tables in Hbase is a very crucial thing because all important functionalities such as Data operations, Data enhancements and Data modeling we can perform it through tables only in HBase.

Handling tables performs the following functions

In HBase, we can perform table operations in two ways

We already have seen how we can perform shell commands and operations in HBase. In this tutorial, we are going to perform some of the operations using Java coding through Java API.

Through Java API, we can create tables in HBase and also load data into tables using Java coding.

In this tutorial - we will learn,

HBase create table with Rows and Column names

In this section, we are going to create tables with column families and rows by

Establishing connection through Java API:

The Following steps guide us to develop Java code to connect HBase through Java API.

Step 1) In this step, we are going to create Java project in eclipse for HBase connection.

Creation of new project name "HbaseConnection" in eclipse.

For Java related project set up or creation of program: Refer /java-tutorial.html

Create, Insert, Read Tables in HBase

If we observe the screen shot above.

  1. Give project name in this box. In our case, we have project name "HbaseConnection"
  2. Check this box for default location to be saved. In this /home/hduser/work/HbaseConnection is the path
  3. Check the box for Java environment here. In this JavaSE-1.7 is the Java edition
  4. Choose your option where you want to save file. In our case, we have selected option second "Create separate folder for sources and class files"
  5. Click on finish button.

Step 2) On eclipse home page follow the following steps

Right click on project -> Select Build Path -> Configure build path

Create, Insert, Read Tables in HBase

From above screenshot

  1. Right click on project
  2. Select build path
  3. Select configure build path

After clicking Configure Build path, it will open another window as shown in below screen shot

In this step, we will add relevant HBase jars into java project as shown in the screenshot.

Create, Insert, Read Tables in HBase

  1. Come to libraries
  2. Press option - Add External Jars
  3. Select required important jars
  4. Press finish button to add these files to 'src' of java project under libraries

After adding these jars, it will show under project "src" location. All the Jar files that fall under the project are now ready for usage with Hadoop ecosystem.

Step 3) In this step by using HBaseConnection.java, the HBase Connection would be established through Java Coding

  1. Select Run
  2. Select Run as Java Application

Create, Insert, Read Tables in HBase

From screen shot above we are performing following functions.

  1. Using HTableDescriptor we can able to create "guru99" table in HBase
  2. Using addFamily method, we are going to add "education" and "projects" as column names to "guru99" table.

The below coding is going to

Code Placed under HBaseConnection_Java document

// Place this code inside Hbase connection
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;						
import org.apache.hadoop.hbase.HBaseConfiguration;							
import org.apache.hadoop.hbase.HColumnDescriptor;							
import org.apache.hadoop.hbase.HTableDescriptor;		
Import org.apache.hadoop.hbase.client.HBaseAdmin;							

public class HBaseConnection							
{							
        public static void main(String[] args) throws IOException						
        {							
HBaseConfigurationhc = new HBaseConfiguration(new Configuration());										
HTableDescriptorht = new HTableDescriptor("guru99"); 										

ht.addFamily( new HColumnDescriptor("education"));					
ht.addFamily( new HColumnDescriptor("projects"));										
System.out.println( "connecting" );										
HBaseAdminhba = new HBaseAdmin( hc );								

System.out.println( "Creating Table" );								
hba.createTable( ht );							
System.out.println("Done......");										
            }						
}

This is required code you have to place in HBaseConnection.java and have to run java program

After running this program, it is going to establish a connection with HBase and in turn it will create a table with column names.

Step 4) We can check whether "guru99" table is created with two columns in HBase or not by using HBase shell mode with "list" command.

The "list" command gives information about all the tables that is created in HBase.

Refer "HBase Shell and General Commands" article for more information on "list" command.

In this screen, we going to do

Create, Insert, Read Tables in HBase

Placing values into tables and retrieving values from table:

In this section, we are going to

For example, we will

Here is the Java Code to be placed under HBaseLoading.java as shown below for both writing and retrieving data.

Code Placed under HBaseLoading_Java document

import java.io.IOException;							
import org.apache.hadoop.hbase.HBaseConfiguration;			
import org.apache.hadoop.hbase.client.Get;							
import org.apache.hadoop.hbase.client.HTable;							
import org.apache.hadoop.hbase.client.Put;							
import org.apache.hadoop.hbase.client.Result;						
import org.apache.hadoop.hbase.client.ResultScanner;						
import org.apache.hadoop.hbase.client.Scan;						
import org.apache.hadoop.hbase.util.Bytes;							
public class HBaseLoading							
{							
public static void main(String[] args) throws IOException						
{							
// When you create a HBaseConfiguration, it reads in whatever you've set into your hbase-site.xml and in hbase-default.xml, as long as these can be found on the CLASSPATH 							

org.apache.hadoop.conf.Configurationconfig = HBaseConfiguration.create();							

//This instantiates an HTable object that connects you to the "test" table			

HTable table = newHTable(config, "guru99");										

// To add to a row, use Put. A Put constructor takes the name of the row you want to insert into as a byte array. 							

  Put p = new Put(Bytes.toBytes("row1"));										

//To set the value you'd like to update in the row 'row1', specify  the column family, column qualifier, and value of the table cell you'd like to update.  The column family must already exist in your table schema.  The qualifier can be anything. 							

p.add(Bytes.toBytes("education"), 

Bytes.toBytes("col1"),Bytes.toBytes("BigData"));								

p.add(Bytes.toBytes("projects"),Bytes.toBytes("col2"),Bytes.toBytes("HBaseTutorials"));

// Once you've adorned your Put instance with all the updates you want to  make, to commit it do the following 							

table.put(p);	

// Now, to retrieve the data we just wrote.	

  Get g = new Get(Bytes.toBytes("row1"));										
  Result r = table.get(g);					

byte [] value = r.getValue(Bytes.toBytes("education"),Bytes.toBytes("col1"));											
byte [] value1 = r.getValue(Bytes.toBytes("projects"),Bytes.toBytes("col2"));											
  String valueStr = Bytes.toString(value);							

String valueStr1 = Bytes.toString(value1);							
System.out.println("GET: " +"education: "+ valueStr+"projects: "+valueStr1);														

  Scan s = new Scan();								

s.addColumn(Bytes.toBytes("education"), Bytes.toBytes("col1"));										
s.addColumn(Bytes.toBytes("projects"), Bytes.toBytes("col2"));										
ResultScanner scanner = table.getScanner(s);							
try							
{							
for (Result rr = scanner.next(); rr != null; rr = scanner.next())									
   {							
System.out.println("Found row : " + rr);										
       }							
} finally							
{							
// Make sure you close your scanners when you are done!						

scanner.close();							
       }							
   }							
}

First of all, we are going to see how to write data, and then we will see how to read data from an hbase table.

Write Data to HBase Table:

Step 1) In this step, we are going to write data into HBase table "guru99"

First we have to write code for insert and retrieve values from HBase by using-HBaseLoading.java program. 

For creating and inserting values into a table at the column level, you have to code like below.

Create, Insert, Read Tables in HBase

From the screen shot above

  1. When we create HBase configuration, it will point to whatever the configurations we set in hbase-site.xml and hbase-default.xml files during HBase installations
  2. Creation of table "guru99" using HTable method
  3. Adding row1 to table "guru99"
  4. Specifying column names "education" and "projects" and inserting values into column names in the respective row1. The values inserted here are "BigData" and "HBaseTutorials".

Read Data from Hbase Table:

Step 2) Whatever the values that we placed in HBase tables in Step (1) , here we are going to fetch and display those values. 

For retrieving results stored in "guru99"

Create, Insert, Read Tables in HBase

The above screen shot shows the data is being read from HBase table 'guru99'

  1. In this, we are going to fetch the values that are stored in column families i.e "education" and "projects"
  2. Using "get" command we are going to fetch stored values in HBase table
  3. Scanning results using "scan" command. The values that are stored in row1 it will display on the console.

Once writing code is done, you have to run java application like this

Retrieving Inserted Values in HBase shell mode

In this section, we will check the following

Create, Insert, Read Tables in HBase

From the above screen shot, we will get these points

Summary:

As we discussed in this article now, we are familiar with how to create tables, loading data into tables and retrieving data from table using Java API through Java coding. We can able to perform all type of shell command functionalities through this Java API. It will establish good client communication with HBase environment.

In our next article, we will see trouble-shooting for the HBase problems.

 

 

HBase: Limitations, Advantage & Problems

 

HBase architecture always has "Single Point Of Failure" feature, and there is no exception handling mechanism associated with it.

Problems with HBase

 

 

Advantage of HBase:

Limitations with HBase:

  1. Problem Statement: Master server initializes but region servers not initializes

Cause:

Solution:

What to change:

Open /etc./hosts and go to this location

127.0.0.1 fully.qualified.regionservernameregionservernamelocalhost.localdomain localhost

: : 1 localhost3.localdomain3 localdomain3

Modify the above configuration like below (remove region server name as highlighted above)

127.0.0.1 localhost.localdomainlocalhost

: : 1 localhost3.localdomain3 localdomain3

  1. Problem Statement: Couldn't find my address: XYZ in list of Zookeeper quorum servers

Cause:

Solution:-

  1. Problem Statement: Created Root Directory for HBase through Hadoop DFS

Cause:

Solution:

  1. Problem statement: Zookeeper session expired events

Cause:

The following shows the exception thrown because of Zookeeper expired event

The highlighted events are some of the exceptions occurred in log file

Log files code as display below:

WARN org.apache.zookeeper.ClientCnxn: Exception

closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec

java.io.IOException: TIMED OUT

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000

INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT

INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT]

INFO org.apache.zookeeper.ClientCnxn: Server connection successful

WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e

java.io.IOException: Session Expired

at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)

at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)

at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

Solution:

<property>
    <name> zookeeper.session.timeout </name>
    <value>1200000</value>
</property>
<property>
    <name> hbase.zookeeper.property.tickTime </name>
    <value>6000</value>
</property>

 

Hive is an open source data warehouse which is initially developed by Facebook for analysis and querying datasets but is now under Apache software foundation.

Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data is stored in HDFS.

Hive is useful for performing operations like data encapsulation, ad-hoc queries, & analysis of huge datasets. Hive's design reflects its targeted use as a system for managing and querying structured data. 


HBase Vs Hive

Features HBase Hive
Data base model Wide Column store Relational DBMS
Data Schema Schema- free With Schema
SQL Support No Yes it uses HQL(Hive query language)
Partition methods Sharding Sharding
Consistency Level Immediate Consistency Eventual Consistency
Secondary indexes No Yes
Replication Methods Selectable replication factor Selectable replication factor

HBase VS RDBMS

While comparing HBase with Traditional Relational databases, we have to take three key areas into consideration. Those are data model, data storage, and data diversity.

HBASE RDBMS
  • Schema-less in database
  • Having fixed schema in database
  • Column-oriented databases
  • Row oriented data store
  • Designed to store De-normalized data
  • Designed to store Normalized data
  • Wide and sparsely populated tables present in HBase
  • Contains thin tables in database
  • Supports automatic partitioning
  • Has no built in support for partitioning
  • Well suited for OLAP systems
  • Well suited for OLTP systems
  • Read only relevant data from database
  • Retrieve one row at a time and hence could read unnecessary data if only some of the data in a row is required
  • Structured and semi-structure data can be stored and processed using HBase
  • Structured data can be stored and processed using RDBMS
  • Enables aggregation over many rows and columns
  • Aggregation is an expensive operation

1) Explain what is Hbase?

Hbase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). Hbase is not a relational data store, and it does not support structured query language like SQL.

In Hbase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.

2) Explain why to use Hbase?

3) Mention what are the key components of Hbase?

4) Explain what does Hbase consists of?

5) Mention how many operational commands in Hbase?

Operational command in Hbases is about five types

6) Explain what is WAL and Hlog in Hbase?

WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s.  These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.

7) When you should use Hbase?

8) In Hbase what is column families?

Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.

9) Explain what is the row key?

Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.

10) Explain deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?

When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible.  Hbase deleted are actually removed during compactions.

Three types of tombstone markers are there:

11) Explain how does Hbase actually delete a row?

In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.

Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.

12) Explain what happens if you alter the block size of a column family on an already occupied database?

When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size.  New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.

13) Mention the difference between Hbase and Relational Database?

Hbase Relational Database
  • It is schema-less
  • It is a column-oriented data store
  • It is used to store de-normalized data
  • It contains sparsely populated tables
  • Automated partitioning is done in Hbase
  •  It is a schema based database
  • It is a row-oriented data store
  • It is used to store normalized data
  • It contains thin tables
  • There is no such provision or built-in support for partitioning