PHP-Fusion Powered Website - Articles: Hadoop Tutorial: Master BigData

Users Online

· Guests Online: 38

· Members Online: 0

· Total Members: 188
· Newest Member: meenachowdary055

Forum Threads

Newest Threads

No Threads created

Hottest Threads

No Threads created

Latest Articles

· Leccture 8
· Udemy – Intro Robo...
· Dropshipping
· Udemy – The Comple...
· Udemy – Course 4: ...

Oh no! Where's the JavaScript?
Your Web browser does not have JavaScript enabled or does not support JavaScript. Please enable JavaScript on your Web browser to properly view this Web site,
or upgrade to a Web browser that does support JavaScript; Firefox, Safari, Opera, Chrome or a version of Internet Explorer newer then version 6.

Articles Hierarchy

Articles Home » Big Data » Hadoop Tutorial: Master BigData

Hadoop Tutorial: Master BigData

A counter in MapReduce is a mechanism used for collecting statistical information about the MapReduce job. This information could be useful for diagnosis of a problem in MapReduce job processing. Counters are similar to putting log message in the code for map or reduce.

In this tutorial we will learn,

Two types of counters

MapReduce Join

MapReduce Hadoop Program To Join Data

Typically, these counters are defined in a program (map or reduce) and are incremented during execution when a particular event or condition (specific to that counter) occurs. A very good application of counters is to track valid and invalid records from an input dataset.

Two types of counters:

1. Hadoop Built-In counters: There are some built-in counters which exist per job. Below are built-in counter groups-

MapReduce Task Counters - Collects task specific information (e.g., number of input records) during its execution time.
FileSystem Counters - Collects information like number of bytes read or written by a task
FileInputFormat Counters - Collects information of number of bytes read through FileInputFormat
FileOutputFormat Counters - Collects information of number of bytes written through FileOutputFormat
Job Counters - These counters are used by JobTracker. Statistics collected by them include e.g., number of task launched for a job.

2. User Defined Counters

In addition to built-in counters, user can define his own counters using similar functionalities provided by programming languages. For example, in Java 'enum' are used to define user defined counters.

An example MapClass with Counters to count the number of missing and invalid values:

public static class MapClass
            extends MapReduceBase
            implements Mapper<LongWritable, Text, Text, Text>
{
    static enum SalesCounters { MISSING, INVALID };
    public void map ( LongWritable key, Text value,
                 OutputCollector<Text, Text> output,
                 Reporter reporter) throws IOException
    {
        
        //Input string is split using ',' and stored in 'fields' array
        String fields[] = value.toString().split(",", -20);
        //Value at 4th index is country. It is stored in 'country' variable
        String country = fields[4];
        
        //Value at 8th index is sales data. It is stored in 'sales' variable
        String sales = fields[8];
      
        if (country.length() == 0) {
            reporter.incrCounter(SalesCounters.MISSING, 1);
        } else if (sales.startsWith("\"")) {
            reporter.incrCounter(SalesCounters.INVALID, 1);
        } else {
            output.collect(new Text(country), new Text(sales + ",1"));
        }
    }
}

Above code snippet shows an example implementation of counters in Map Reduce.

Here, SalesCounters is a counter defined using 'enum'. It is used to count MISSING and INVALID input records.

In the code snippet, if 'country' field has zero length then its value is missing and hence corresponding counter SalesCounters.MISSING is incremented.

Next, if 'sales' field starts with a " then the record is considered INVALID. This is indicated by incrementing counter SalesCounters.INVALID.

MapReduce Join

Joining two large dataset can be achieved using MapReduce Join. However, this process involves writing lots of code to perform actual join operation.

Joining of two datasets begin by comparing size of each dataset. If one dataset is smaller as compared to the other dataset then smaller dataset is distributed to every datanode in the cluster. Once it is distributed, either Mapper or Reducer uses smaller dataset to perform lookup for matching records from large dataset and then combine those records to form output records.

Depending upon the place where actual join is performed, this join is classified into-

1. Map-side join - When the join is performed by the mapper, it is called as map-side join. In this type, the join is performed before data is actually consumed by the map function. It is mandatory that the input to each map is in the form of a partition and is in sorted order. Also, there must be an equal number of partitions and it must be sorted by the join key.

2. Reduce-side join - When the join is performed by the reducer, it is called as reduce-side join. There is no necessity in this join to have dataset in a structured form (or partitioned).

Here, map side processing emits join key and corresponding tuples of both the tables. As an effect of this processing, all the tuples with same join key fall into the same reducer which then joins the records with same join key.

Overall process flow is depicted in below diagram.

MapReduce Hadoop Program To Join Data

Problem Statement:

There are 2 Sets of Data in 2 Different Files

The Key Dept_ID is common in both files.

The goal is to use MapReduce Join to combine these files

Input: Our input data set is a txt file, DeptName.txt & DepStrength.txt

Download Input Files From Here

Prerequisites:

This tutorial is developed on Linux - Ubuntu operating System.
You should have Hadoop (version 2.2.0 used for this tutorial) already installed.
You should have Java(version 1.8.0 used for this tutorial) already installed on the system.

Before we start with the actual process, change user to 'hduser' (user used for Hadoop ).

su - hduser_