PHP-Fusion Powered Website - Articles: Hadoop Tutorial: Master BigData

Users Online

· Guests Online: 38

· Members Online: 0

· Total Members: 188
· Newest Member: meenachowdary055

Forum Threads

Newest Threads

No Threads created

Hottest Threads

No Threads created

Latest Articles

· Leccture 8
· Udemy – Intro Robo...
· Dropshipping
· Udemy – The Comple...
· Udemy – Course 4: ...

Oh no! Where's the JavaScript?
Your Web browser does not have JavaScript enabled or does not support JavaScript. Please enable JavaScript on your Web browser to properly view this Web site,
or upgrade to a Web browser that does support JavaScript; Firefox, Safari, Opera, Chrome or a version of Internet Explorer newer then version 6.

Articles Hierarchy

Articles Home » Big Data » Hadoop Tutorial: Master BigData

Hadoop Tutorial: Master BigData

In this tutorial we will discuss Pig & Hive

INTRODUCTION TO PIG

In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.

Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That's why the name, Pig!

Pig consists of two components:

Pig Latin, which is a language
Runtime environment, for running PigLatin programs.

A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus on data rather than the nature of execution.

PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Execution modes:

Pig has two execution modes:

Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.

Create your First PIG Program

Problem Statement:

Find out Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv

Prerequisites:

This tutorial is developed on Linux - Ubuntu operating System.

You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.

You should have Java (version 1.8.0 used for this tutorial) already installed on the system.

You should have set JAVA_HOME accordingly.

This guide is divided into 2 parts

Pig Installation
Pig Demo

PART 1) Pig Installation

Before we start with the actual process, change user to 'hduser' (user used for Hadoop configuration).

Step 1) Download stable latest release of Pig (version 0.12.1 used for this tutorial) from any one of the mirrors sites available at

http://pig.apache.org/releases.html

Select tar.gz (and not src.tar.gz) file to download.

Step 2) Once download is complete, navigate to the directory containing the downloaded tar file and move the tar to the location where you want to setup Pig. In this case we will move to /usr/local

Move to directory containing Pig Files

cd /usr/local

Extract contents of tar file as below

sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables

Open ~/.bashrc file in any text editor of your choice and do below modifications-

export PIG_HOME=<Installation directory of Pig>

export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH

Step 4) Now, source this environment configuration using below command

. ~/.bashrc

Step 5) We need to recompile PIG to support Hadoop 2.2.0

Here are the steps to do this-

Go to PIG home directory

cd $PIG_HOME

Install ant

sudo apt-get install ant

Note: Download will start and will consume time as per your internet speed.

Recompile PIG

sudo ant clean jar-all -Dhadoopversion=23

Please note that, in this recompilation process multiple components are downloaded. So, system should be connected to internet.

Also, in case this process stuck somewhere and you dont see any movement on command prompt for more than 20 minutes then press ctrl + c and rerun the same command.

In our case it takes 20 minutes

Step 6) Test the Pig installation using command

pig -help

PART 2) Pig Demo

Step 7) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

Step 8) Pig takes file from HDFS in MapReduce mode and stores the results back to HDFS.

Copy file SalesJan2009.csv (stored on local file system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File System) Home Directory

Here the file is in Folder input. If the file is stored in some other location give that name

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether file is actually copied of not.

$HADOOP_HOME/bin/hdfs dfs -ls /

Step 9) Pig Configuration

First navigate to $PIG_HOME/conf

cd $PIG_HOME/conf

sudo cp pig.properties pig.properties.original

Open pig.properties using text editor of your choice, and specify log file path using pig.logfile

sudo gedit pig.properties

Loger will make use of this file to log errors.

Step 10) Run command 'pig' which will start Pig command prompt which is an interactive shell Pig queries.

pig

Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.

-- A. Load the file containing data.

salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS (Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,Last_Login:chararray,Latitude:chararray,Longitude:chararray);

Press Enter after this command.