Users Online

· Guests Online: 99

· Members Online: 0

· Total Members: 188
· Newest Member: meenachowdary055

Forum Threads

Newest Threads
No Threads created
Hottest Threads
No Threads created

Latest Articles

Hadoop Tutorial: Master BigData

Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

In this tutorial we will learn,

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system.

Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data.

This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program, processes data stored in Hadoop HDFS.

HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers.

Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.

Do you know?  Computer cluster consists of a set of multiple processing units (storage disk + processor) which are connected to each other and acts as a single system.

Components of Hadoop

Below diagram shows various components in Hadoop ecosystem-

Hadoop Tutorial: Features, Components, Cluster & Topology

Apache Hadoop consists of two sub-projects –

  1. Hadoop MapReduce : MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS (Hadoop Distributed File System): HDFS takes care of storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are HiveHBaseMahoutSqoop , Flume and ZooKeeper.

Features Of 'Hadoop'

• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since, it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase efficiency of Hadoop based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes, and thus allows for growth of Big Data. Also, scaling does not require modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.

Network Topology In Hadoop

Topology (Arrangment) of the network, affects performance of the Hadoop cluster when size of the hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster formation makes use of network topology.

Hadoop Tutorial: Features, Components, Cluster & Topology

Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common ancestor.

Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon location of the processes. That is, bandwidth available becomes lesser as we go away from-

  • Processes on the same node
  • Different nodes on the same rack
  • Nodes on different racks of the same data center
  • Nodes in different data centers

Prerequisites:

You must have Ubuntu installed and running

You must have Java Installed.

Step 1) Add a Hadoop system user using below command

sudo addgroup hadoop_

Hadoop Setup Tutorial - Installation & Configuration

sudo adduser --ingroup hadoop_ hduser_

Hadoop Setup Tutorial - Installation & Configuration

Enter your password , name and other details.

NOTE:

There is a possibility of below mentioned error in this setup and installation process.

"hduser is not in the sudoers file. This incident will be reported."

Hadoop Setup Tutorial - Installation & Configuration

This error can be resolved by

Login as a root user

Hadoop Setup Tutorial - Installation & Configuration

Execute the command

sudo adduser hduser_ sudo

Hadoop Setup Tutorial - Installation & Configuration

Re-login as hduser_

Hadoop Setup Tutorial - Installation & Configuration

Step 2) . Configure SSH

In order to manage nodes in a cluster, Hadoop require SSH access

First, switch user, enter following command

su - hduser_

Hadoop Setup Tutorial - Installation & Configuration

This command will create a new key.

ssh-keygen -t rsa -P ""

Hadoop Setup Tutorial - Installation & Configuration

Enable SSH access to local machine using this key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Hadoop Setup Tutorial - Installation & Configuration

Now test SSH setup by connecting to locahost as 'hduser' user.

ssh localhost

Hadoop Setup Tutorial - Installation & Configuration

Note:

Please note, if you see below error in response to 'ssh localhost', then there is a possibility that SSH is not available on this system-

Hadoop Setup Tutorial - Installation & Configuration

To resolve this -

Purge SSH using,

sudo apt-get purge openssh-server

It is good practice to purge before start of installation

Hadoop Setup Tutorial - Installation & Configuration

Install SSH using command-

sudo apt-get install openssh-server

Hadoop Setup Tutorial - Installation & Configuration

Step 3) Next step is to Download Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Select Stable

Hadoop Setup Tutorial - Installation & Configuration

Select the tar.gz file ( not the file with src)

Hadoop Setup Tutorial - Installation & Configuration

Once download is complete, navigate to the directory containing the tar file

Hadoop Setup Tutorial - Installation & Configuration

Enter , sudo tar xzf hadoop-2.2.0.tar.gz

Hadoop Setup Tutorial - Installation & Configuration

Now, rename rename hadoop-2.2.0 as hadoop

sudo mv hadoop-2.2.0 hadoop

Hadoop Setup Tutorial - Installation & Configuration

sudo chown -R hduser_:hadoop_ hadoop

Hadoop Setup Tutorial - Installation & Configuration

Step 4) Modify ~/.bashrc file

Add following lines to end of file ~/.bashrc

#Set HADOOP_HOME
export HADOOP_HOME=<Installation Directory of Hadoop>
#Set JAVA_HOME
export JAVA_HOME=<Installation Directory of Java>
# Add bin/ directory of Hadoop to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Setup Tutorial - Installation & Configuration


Now, source this environment configuration using below command

. ~/.bashrc

Hadoop Setup Tutorial - Installation & Configuration

Step 5) Configurations related to HDFS

Set JAVA_HOME inside file $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Hadoop Setup Tutorial - Installation & Configuration

Hadoop Setup Tutorial - Installation & Configuration

With

Hadoop Setup Tutorial - Installation & Configuration

There are two parameters in $HADOOP_HOME/etc/hadoop/core-site.xml which need to be set-

1. 'hadoop.tmp.dir' - Used to specify directory which will be used by Hadoop to store its data files.

2. 'fs.default.name' - This specifies the default file system.

To set these parameters, open core-site.xml

sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Copy below line in between tags <configuration></configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>Parent directory for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS </name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. </description>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Navigate to the directory $HADOOP_HOME/etc/Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Now, create the directory mentioned in core-site.xml

sudo mkdir -p <Path of Directory used in above setting>

Hadoop Setup Tutorial - Installation & Configuration

Grant permissions to the directory

sudo chown -R hduser_:Hadoop_ <Path of Directory created in above step>

Hadoop Setup Tutorial - Installation & Configuration

sudo chmod 750 <Path of Directory created in above step>

Hadoop Setup Tutorial - Installation & Configuration

Step 6) Map Reduce Configuration

Before you begin with these configurations, lets set HADOOP_HOME path

sudo gedit /etc/profile.d/hadoop.sh

And Enter

export HADOOP_HOME=/home/guru99/Downloads/Hadoop

Hadoop Setup Tutorial - Installation & Configuration

Next enter

sudo chmod +x /etc/profile.d/hadoop.sh

Hadoop Setup Tutorial - Installation & Configuration

Exit the Terminal and restart again

Type echo $HADOOP_HOME. To verify the path

Hadoop Setup Tutorial - Installation & Configuration

Now copy files

sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Open the mapred-site.xml file

sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

Hadoop Setup Tutorial - Installation & Configuration

Add below lines of setting in between tags <configuration> and </configuration>

<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
<description>MapReduce job tracker runs at this host and port.
</description>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Open $HADOOP_HOME/etc/hadoop/hdfs-site.xml as below,

sudo gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

 

 

Hadoop Setup Tutorial - Installation & Configuration

Add below lines of setting between tags <configuration> and </configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hduser_/hdfs</value>
</property>

Hadoop Setup Tutorial - Installation & Configuration

Create directory specified in above setting-

sudo mkdir -p <Path of Directory used in above setting>

sudo mkdir -p /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

sudo chown -R hduser_:hadoop_ <Path of Directory created in above step>

sudo chown -R hduser_:hadoop_ /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

sudo chmod 750 <Path of Directory created in above step>

sudo chmod 750 /home/hduser_/hdfs

Hadoop Setup Tutorial - Installation & Configuration

Step 7) Before we start Hadoop for the first time, format HDFS using below command

$HADOOP_HOME/bin/hdfs namenode -format

Hadoop Setup Tutorial - Installation & Configuration

Step 8) Start Hadoop single node cluster using below command

$HADOOP_HOME/sbin/start-dfs.sh

Output of above command

Hadoop Setup Tutorial - Installation & Configuration

$HADOOP_HOME/sbin/start-yarn.sh

Hadoop Setup Tutorial - Installation & Configuration

Using 'jps' tool/command, verify whether all the Hadoop related processes are running or not.

Hadoop Setup Tutorial - Installation & Configuration

If Hadoop has started successfully then output of jps should show NameNode, NodeManager, ResourceManager, SecondaryNameNode, DataNode.

 

Step 9) Stopping Hadoop

 

$HADOOP_HOME/sbin/stop-dfs.sh


Hadoop Setup Tutorial - Installation & Configuration

 

$HADOOP_HOME/sbin/stop-yarn.sh


Hadoop Setup Tutorial - Installation & Configuration

 

Comments

No Comments have been Posted.

Post Comment

Please Login to Post a Comment.

Ratings

Rating is available to Members only.

Please login or register to vote.

No Ratings have been Posted.
Render time: 0.85 seconds
10,260,413 unique visits