Hadoop Tutorial: Master BigData
Posted by Superadmin on January 02 2019 12:36:48

Hadoop Tutorial: Master BigData

 

BigData is the latest buzzword in the IT Industry. Apache’s Hadoop is a leading Big Data platform used by IT giants Yahoo, Facebook & Google. This course is geared to make a Hadoop Expert.

What should I know?


This is an absolute beginner guide to Hadoop. But knowledge of 1) Java 2) Linux will help

Syllabus

 Tutorial Introduction to BIG DATA: Types, Characteristics & Benefits
 Tutorial Hadoop Tutorial: Features, Components, Cluster & Topology
 Tutorial Hadoop Setup Tutorial - Installation & Configuration
 Tutorial HDFS Tutorial: Read & Write Commands using Java API
 Tutorial What is MapReduce? How it Works - Hadoop MapReduce Tutorial
 Tutorial Hadoop & Mapreduce Examples: Create your First Program
 Tutorial Hadoop MapReduce Tutorial: Counters & Joins with Example
 Tutorial What is Sqoop? What is FLUME - Hadoop Tutorial
 Tutorial Sqoop vs Flume vs HDFS in Hadoop
 Tutorial Create Your First FLUME Program - Beginner's Tutorial
 Tutorial Hadoop PIG Tutorial: Introduction, Installation & Example
 Tutorial Learn OOZIE in 5 Minutes - Hadoop Tutorial
 Tutorial Big Data Testing: Functional & Performance
 Tutorial Hadoop & MapReduce Interview Questions & Answers

Check!

 Tutorial Top 15 Big Data Tools
 Tutorial 11 Best Big Data Analytics Tools

 

 

 

 

Introduction to BIG DATA: Types, Characteristics & Benefits

 

In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary defines 'data' as -

"The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. "

In this tutorial we will learn,

So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time.In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

 

 

 Introduction to BIG DATA: Types, Characteristics & Benefits

Examples Of 'Big Data'

Following are some the examples of 'Big Data'-

 Introduction to BIG DATA: Types, Characteristics & Benefits The New York Stock Exchange generates about one terabyte of new trade data per day.

• Social Media Impact

Statistic shows that 500+terabytes of new data gets ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

 Introduction to BIG DATA: Types, Characteristics & Benefits
 Introduction to BIG DATA: Types, Characteristics & Benefits Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Categories Of 'Big Data'

Big data' could be found in three forms:

  1. Structured
  2. Unstructured
  3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time, talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabyte.

Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name 'Big Data' is given and imagine the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of a 'structured' data.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs
2365  Rajesh Kulkarni  Male  Finance 650000
3398  Pratibha Joshi  Female  Admin  650000
7465  Shushil Roy  Male  Admin  500000
7500  Shubhojit Das  Male  Finance  500000
7699  Priya Sane  Female  Finance  550000

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos etc. Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

Output returned by 'Google Search'

 

 Introduction to BIG DATA: Types, Characteristics & Benefits

 

 Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a strcutured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in XML file.

Examples Of Semi-structured Data

Personal data stored in a XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Data Growth over years

 Introduction to BIG DATA: Types, Characteristics & Benefits

 Please note that web application data, which is unstructured, consists of log files, transaction history files etc. OLTP systems are built to work with structured data wherein data is stored in relations (tables).

Characteristics Of 'Big Data'

(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.

(ii)Variety – The next aspect of 'Big Data' is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analysing data.

(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Benefits of Big Data Processing

Ability to process 'Big Data' brings in multiple benefits, such as-

• Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

• Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with 'Big Data' technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

• Early identification of risk to the product/services, if any

• Better operational efficiency

'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload infrequently accessed data.

 

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system.

Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data.

This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program, processes data stored in Hadoop HDFS.

HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers.

Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.

Do you know?  Computer cluster consists of a set of multiple processing units (storage disk + processor) which are connected to each other and acts as a single system.

Components of Hadoop

Below diagram shows various components in Hadoop ecosystem-

Hadoop Tutorial: Features, Components, Cluster & Topology

Apache Hadoop consists of two sub-projects –

  1. Hadoop MapReduce : MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS (Hadoop Distributed File System): HDFS takes care of storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are HiveHBaseMahoutSqoop , Flume and ZooKeeper.

Features Of 'Hadoop'

• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since, it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase efficiency of Hadoop based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes, and thus allows for growth of Big Data. Also, scaling does not require modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.

Network Topology In Hadoop

Topology (Arrangment) of the network, affects performance of the Hadoop cluster when size of the hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster formation makes use of network topology.

Hadoop Tutorial: Features, Components, Cluster & Topology

Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common ancestor.

Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon location of the processes. That is, bandwidth available becomes lesser as we go away from-

Prerequisites:

You must have Ubuntu installed and running

You must have Java Installed.

Step 1) Add a Hadoop system user using below command

sudo addgroup hadoop_

Hadoop Setup Tutorial - Installation & Configuration

sudo adduser --ingroup hadoop_ hduser_

Hadoop Setup Tutorial - Installation & Configuration

Enter your password , name and other details.

NOTE:

There is a possibility of below mentioned error in this setup and installation process.

"hduser is not in the sudoers file. This incident will be reported."

Hadoop Setup Tutorial - Installation & Configuration

This error can be resolved by

Login as a root user

Hadoop Setup Tutorial - Installation & Configuration

Execute the command

sudo adduser hduser_ sudo

Hadoop Setup Tutorial - Installation & Configuration

Re-login as hduser_

Hadoop Setup Tutorial - Installation & Configuration

Step 2) . Configure SSH

In order to manage nodes in a cluster, Hadoop require SSH access

First, switch user, enter following command

su - hduser_

Hadoop Setup Tutorial - Installation & Configuration

This command will create a new key.

ssh-keygen -t rsa -P ""

Hadoop Setup Tutorial - Installation & Configuration

Enable SSH access to local machine using this key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Hadoop Setup Tutorial - Installation & Configuration

Now test SSH setup by connecting to locahost as 'hduser' user.

ssh localhost

Hadoop Setup Tutorial - Installation & Configuration

Note:

Please note, if you see below error in response to 'ssh localhost', then there is a possibility that SSH is not available on this system-

Hadoop Setup Tutorial - Installation & Configuration

To resolve this -

Purge SSH using,

sudo apt-get purge openssh-server

It is good practice to purge before start of installation

Hadoop Setup Tutorial - Installation & Configuration

Install SSH using command-

sudo apt-get install openssh-server

Hadoop Setup Tutorial - Installation & Configuration

Step 3) Next step is to Download Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Select Stable

Hadoop Setup Tutorial - Installation & Configuration

Select the tar.gz file ( not the file with src)

Hadoop Setup Tutorial - Installation & Configuration

Once download is complete, navigate to the directory containing the tar file

Hadoop Setup Tutorial - Installation & Configuration

Enter , sudo tar xzf hadoop-2.2.0.tar.gz

Hadoop Setup Tutorial - Installation & Configuration

Now, rename rename hadoop-2.2.0 as hadoop

sudo mv hadoop-2.2.0 hadoop

Hadoop Setup Tutorial - Installation & Configuration

sudo chown -R hduser_:hadoop_ hadoop

Hadoop Setup Tutorial - Installation & Configuration

Step 4) Modify ~/.bashrc file

Add following lines to end of file ~/.bashrc

#Set HADOOP_HOME
export HADOOP_HOME=<Installation Directory of Hadoop>
#Set JAVA_HOME
export JAVA_HOME=<Installation Directory of Java>
# Add bin/ directory of Hadoop to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Setup Tutorial - Installation & Configuration


Now, source this environment configuration using below command

. ~/.bashrc

Hadoop Setup Tutorial - Installation & Configuration

Step 5) Configurations related to HDFS

Set JAVA_HOME inside file $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Hadoop Setup Tutorial - Installation & Configuration

Hadoop Setup Tutorial - Installation & Configuration

With

Hadoop Setup Tutorial - Installation & Configuration

There are two parameters in $HADOOP_HOME/etc/hadoop/core-site.xml which need to be set-

1. 'hadoop.tmp.dir' - Used to specify directory which will be used by Hadoop to store its data files.

2. 'fs.default.name' - This specifies the default file system.

To set these parameters, open core-site.xml

sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Copy below line in between tags <configuration></configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>Parent directory for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS </name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. </description>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Navigate to the directory $HADOOP_HOME/etc/Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Now, create the directory mentioned in core-site.xml

sudo mkdir -p <Path of Directory used in above setting>

Hadoop Setup Tutorial - Installation & Configuration

Grant permissions to the directory

sudo chown -R hduser_:Hadoop_ <Path of Directory created in above step>

Hadoop Setup Tutorial - Installation & Configuration

sudo chmod 750 <Path of Directory created in above step>

Hadoop Setup Tutorial - Installation & Configuration

Step 6) Map Reduce Configuration

Before you begin with these configurations, lets set HADOOP_HOME path

sudo gedit /etc/profile.d/hadoop.sh

And Enter

export HADOOP_HOME=/home/guru99/Downloads/Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Next enter

sudo chmod +x /etc/profile.d/hadoop.sh

Hadoop Setup Tutorial - Installation & Configuration

Exit the Terminal and restart again

Type echo $HADOOP_HOME. To verify the path

Hadoop Setup Tutorial - Installation & Configuration

Now copy files

sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Open the mapred-site.xml file

sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Add below lines of setting in between tags <configuration> and </configuration>

<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
<description>MapReduce job tracker runs at this host and port.
</description>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Open $HADOOP_HOME/etc/hadoop/hdfs-site.xml as below,

sudo gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

 

 

Hadoop Setup Tutorial - Installation & Configuration

Add below lines of setting between tags <configuration> and </configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hduser_/hdfs</value>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Create directory specified in above setting-

sudo mkdir -p <Path of Directory used in above setting>

sudo mkdir -p /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

sudo chown -R hduser_:hadoop_ <Path of Directory created in above step>

sudo chown -R hduser_:hadoop_ /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

sudo chmod 750 <Path of Directory created in above step>

sudo chmod 750 /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

Step 7) Before we start Hadoop for the first time, format HDFS using below command

$HADOOP_HOME/bin/hdfs namenode -format

Hadoop Setup Tutorial - Installation & Configuration

Step 8) Start Hadoop single node cluster using below command

$HADOOP_HOME/sbin/start-dfs.sh

Output of above command

Hadoop Setup Tutorial - Installation & Configuration

$HADOOP_HOME/sbin/start-yarn.sh

Hadoop Setup Tutorial - Installation & Configuration

Using 'jps' tool/command, verify whether all the Hadoop related processes are running or not.

Hadoop Setup Tutorial - Installation & Configuration

If Hadoop has started successfully then output of jps should show NameNode, NodeManager, ResourceManager, SecondaryNameNode, DataNode.

 

Step 9) Stopping Hadoop

 

$HADOOP_HOME/sbin/stop-dfs.sh


Hadoop Setup Tutorial - Installation & Configuration

 

$HADOOP_HOME/sbin/stop-yarn.sh


Hadoop Setup Tutorial - Installation & Configuration