MapReduce Programming Hello World Job

In the Hadoop and MapReduce tutorial we will see how to create hello world job and what are the steps to creating a mapreduce program. There are following steps to creating mapreduce program.

Step1- Creating a file
J$ cat>file.txt
hi how are you
how is your job
how is your family
how is your brother
how is your sister
what is the time now
what is the strength of hadoop
Step2- loading file.txt from local file system to HDFS
J$ hadoop fs -put file.txt  file

Step3- Writing programs
  1.  DriverCode.java
  2.  MapperCode.java
  3.  ReducerCode.java

Step4- Compiling all above .java files

J$ javac -classpath $HADOOP_HOME/hadoop-core.jar *.java

Step5- Creating jar file

J$ jar cvf job.jar *.class

Step6- Running above job.jar on file (which there in HDFS)

J$ hadoop jar job.jar DriverCode file TestOutput

Lets start with actual code for these steps above.

Hello World Job -> WordCountJob

1. DriverCode (WordCount.java)
package com.doj.hadoop.driver;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

 * @author Dinesh Rajput
public class WordCount extends Configured implements Tool {
 public int run(String[] args) throws Exception {
   if (args.length != 2) {
       System.err.printf("Usage: %s [generic options] <input> <output>\n",
       return -1;
   JobConf conf = new JobConf(WordCount.class);
   conf.setJobName("Word Count");
   FileInputFormat.addInputPath(conf, new Path(args[0]));
   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
   return conf.waitForCompletion(true) ? 0 : 1;
 public static void main(String[] args) throws Exception {
   int exitCode = ToolRunner.run(new WordCount(), args);
2. MapperCode (WordMapper.java)
package com.doj.hadoop.driver;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

 * @author Dinesh Rajput
public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();
 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
  throws IOException
  String line = value.toString();
  StringTokenizer tokenizer = new StringTokenizer(line);
  while (tokenizer.hasMoreTokens())
   output.collect(word, one);

3. ReducedCode (WordReducer.java)
package com.doj.hadoop.driver;

 * @author Dinesh Rajput
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
   Reporter reporter) throws IOException
  int sum = 0;
  while (values.hasNext())
   sum += values.next().get();
  output.collect(key, new IntWritable(sum));

MapReduce Flow Chart Sample Example

In this mapreduce tutorial we will explain mapreduce sample example with its flow chart. How to work mapreduce for a job.

  • We have a large collection of text documents in a folder. (Just to give a feel size.. we have 1000 documents each with average of 1 Millions words)
  • What we need to calculate:-
    • Count the frequency of each distinct word in the documents?
  • How would you solve this using simple Java program?
  • How many lines of codes will u write?
  • How much will be the program execution time?
To overcome listed above problems into some line using mapreduce program. Now we look into below mapreduce function for understanding how to its work on large dataset.

  • Map Functions operate on every key, value pair of data and transformation logic provided in the map function.
  • Map Function always emits a Key, Value Pair as output
       Map(Key1, Valiue1) --> List(Key2, Value2)
  • Map Function transformation is similar to Row Level Function in Standard SQL
  • For Each File
    • Map Function is
      • Read each line from the input file
        • Tokenize and get each word
          • Emit the word, 1 for every word found
The emitted word, 1 will from the List that is output from the mapper

So who take ensuring the file is distributed and each line of the file is passed to each of the map function?-Hadoop Framework take care about this, no need to worry about the distributed system.

  • Reduce Functions takes list of value for every key and transforms the data based on the (aggregation) logic provided in the reduce function.
  • Reduce Function
        Reduce(Key2, List(Value2)) --> List(Key3, Value3)
  • Reduce Functions is similar to Aggregate Functions in Standard SQL
Reduce(Key2, List(Value2)) --> List(Key3, Value3)

For the List(key, value) output from the mapper Shuffle and Sort the data by key
Group by Key and create the list of values for a key
  • Reduce Function is
    • Read each key (word) and list of values (1,1,1..) associated with it.
      • For each key add the list of values to calculate sum
        • Emit the word, sum for every word found
So who is ensuring the shuffle, sort, group by etc?

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

context.write(word, one);

public void reduce(Text key, Iterable <IntWritable> values, Context context) throws IOException, InterruptedException{

int sum = 0;
for(IntWritable val : values){
sum += val.get();
context.write(key, new IntWritable(sum));



Suppose we have a file with size about 200 MB, suppose content as follows

_______File(200 MB)____________
hi how are you
how is your job (64 MB) 1-Split
how is your family
how is your brother (64 MB) 2-Split
how is your sister
what is the time now (64 MB) 3-Split
what is the strength of hadoop (8 MB) 4-Split

In above file we have divided this file into 4 splits with sizes three splits with size 64 MB and last fourth split with size 8 MB.

Input File Formats:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat


Lets see in another following figure to understand the process of MAPREDUCE.


Saving Objects using Hibernate APIs

Introduction to Hibernate
  • An ORM tool.
  • Used in the data layer of application.
  • Implement JPA.
  • JPA (Java persistence api):- this is set of standards for and java persistence api.

The problem with JDBC
  • Mapping member variables to the column.
  • Mapping relationships.
  • Handling data types.
  • Example Boolean most of the databases do not have Boolean datatype.
  • Managing the state of object.

Saving with an object with hibernate.
  1. JDBC database configuration >> Hibernate configuration .cfg.xml file.
  2. The model object >> Annotations or Hibernate mapping file .hbm.xml files.
  3. Service method to create the model object >> Use the hibernate.
  4. Database design. >> No needed.
  5. DAO method to save the object using sql queries >> No needed.

Now, i am going to create a simple POJO class Book.java with some properties and their setters and getters.
package com.bloggers.model;
import java.util.Date;
public class Book
 /* for primary key in database table */
 private int bookId;
 private String bookName;
 private Date date;
 public int getBookId()
  return bookId;
 public void setBookId(int bookId)
  this.bookId = bookId;
 public String getBookName()
  return bookName;
 public void setBookName(String bookName)
  this.bookName = bookName;
 public Date getDate()
  return date;
 public void setDate(Date date)
  this.date = date;

Few point while creating a model object or JavaBeans (Book.java) class.
  • bookId property holds a unique identifier value for a book object. Provides an identifier property if you want to use the full feature set of Hibernate
  • Application need to distinguish objects by identifier.
  • Hibernate can access public, private, and protected accessor methods, as well as public, private and protected fields directly.
  • no-argument constructor is a requirement for all persistent classes. Hibernate create objects using Java Reflection.

Mapping file :-hbm file (Hibernate Mapping file)
Now second step is we have to create a hbm file with extension .hbm.xml. HBM file means Hibernate Mapping File. Mapping file gives all the information to hibernate about table columns etc which are mapped to object.
All persistent entity classes need a mapping to a table in the SQL database.
Basic structor of hbm file
<hibernate -mapping="-mapping">

  • Hibernate not loads the DTD file from the web, it first look it up from the classpath of the application.
  • DTD file is included in hibernate-core.jar

Mapping file of Book.java class(Book.hbm.xml)
I will complete this step by step and try to explain all the elements.
1. Add a class element between two hibernate-mapping tags.
<?xml version="1.0"?>
   <!DOCTYPE hibernate-mapping PUBLIC
    "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
   <class name="com.bloggers.model.Book" >

class="com.bloggers.model.Book" .Specify the qualified class name in name attribute of class element.

2. Next, we need to tell Hibernate about the remaining entity class properties. By default, no properties of the class are considered persistent.
<?xml version="1.0"?>
   <!DOCTYPE hibernate-mapping PUBLIC
     "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
    <class name="com.bloggers.model.Book" >
     <id name="bookId"/>
     <property name="date" column="PUBLISH_DATE"/>
name="bookId" attribute of the property element tells Hibernate which getter and setter methods to use in our case hibernate will search for getBookId() , setBookId() etc methods.

Question : Why does the date property mapping include the column attribute, but the bookID does not?
Answer : Without the column attribute, Hibernate by default uses the property name as the column name. This works for bookId, however, date is a reserved keyword in most databases so you will need to map it to a different name.

Create a Hibernate Configuration File (hibernate.cfg.xml)
Hibernate required a configuration file for making the connection with database by default the file name is hibernate.cfg.xml

We can do hibernate configuration in 3 different ways and these are :-
  • we can use a simple hibernate.properties file
  • a more sophisticated hibernate.cfg.xml file
  • or even complete programmatic setup

I am going to use the 2 option means by using hibernate.cfg.xml file.
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE hibernate-configuration PUBLIC
  "-//Hibernate/Hibernate Configuration DTD 3.0//EN"
  <!-- Database connection settings -->
  <property name="connection.driver_class">com.mysql.jdbc.Driver</property>
  <property name="connection.url">jdbc:mysql://localhost:3306/hibernatedb</property>
  <property name="connection.username">root</property>
  <property name="connection.password">root</property>
  <!-- JDBC connection pool (use the built-in) -->
  <property name="connection.pool_size">1</property>
  <!-- SQL dialect -->
  <property name="dialect">org.hibernate.dialect.MySQL5Dialect</property>
  <!-- Disable the second-level cache  -->
  <property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>
  <!-- Echo all executed SQL to stdout -->
  <property name="show_sql">true</property>
  <!-- Drop and re-create the database schema on startup -->
  <property name="hbm2ddl.auto">create</property>
  <mapping resource="com/bloggers/model/Book.hbm.xml"/>

  1. SessionFactory is a global factory responsible for a particular database. If you have several databases, for easier startup you should use several configurations in several configuration files.
  2. The first four property elements contain the necessary configuration for the JDBC connection.
  3. The dialect property element specifies the particular SQL variant Hibernate generates.
  4. In most cases, Hibernate is able to properly determine which dialect to use.
  5. The hbm2ddl.auto option turns on automatic generation of database schema directly into the database. This can also be turned off by removing the configuration option, or redirected to a file with the help of the SchemaExport Ant task.
  6. Finally, add the mapping file(s) for persistent classes to the configuration.

Utility class for getting the sessionFactory.
package com.bloggers.util;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;
public class HibernateUtil
 private static final SessionFactory sessionFactory = buildSessionFactory();
 private static SessionFactory buildSessionFactory()
   // Create the SessionFactory from hibernate.cfg.xml
   return new Configuration().configure().buildSessionFactory();
  catch (Throwable ex)
   // Make sure you log the exception, as it might be swallowed
   System.err.println("Initial SessionFactory creation failed." + ex);
   throw new ExceptionInInitializerError(ex);
 public static SessionFactory getSessionFactory()
  return sessionFactory;

Run the example
Create a simple class for running the example.
package com.bloggers;
import java.util.Date;
import org.hibernate.Session;
import org.hibernate.SessionFactory;
import com.bloggers.model.Book;
import com.bloggers.util.HibernateUtil;
public class HibernateTest
 public static void main(String[] args)
  HibernateTest hibernateTest = new HibernateTest();
  /* creating a new book object */
  Book book = new Book();
  book.setDate(new Date());
  book.setBookName("Hibernate in action");
  /* saving the book object */
 public void saveBook(Book book)
  SessionFactory sessionFactory = HibernateUtil.getSessionFactory();
  Session session = sessionFactory.openSession();
  /* saving the book object here */

Before running the HibernateTest class make sure you have created a schema named hibernatedb in MySQL database .

Question : Why schema named hibernatedb ?
Answer : Because in hibenate.cfg.xml

<property name="connection.url">jdbc:mysql://localhost:3306/hibernatedb</property>

we have specify hibernatedb. You are free to change whatever name you want. Create schema using mysql>create schema hibernate; command.

Saving Objects using Hibernate APIs

Question : Did you see the table it does not contains the column name bookName?
Answer : If yes then you can easily found that why we do not have bookName in table book and the answer is because we did not map the bookName property in our hbm file so map the property and run the main method again and you will observe that.
  • Hibernate automatically add a new column in book table.
  • When you run the select query on database again you see that it also contain bookName

Introduction to MapReduce

In this hadoop tutorial we will introduce map reduce, what is map reduce. Before map reduce how to analyze the bigdata. Please look into following picture.

Introduction to MapReduce

Here bigdata split into equal size and grep it using linux command and matches with some specific characters like high temperature of any large data set of weather department. But this way have some problems as follows.

Problems in the Traditional way analysis-

1. Critical path problem (Its amount of time to take to finish the job without delaying the next milestone or actual completion date).
2. Reliability problem
3. Equal split issues
4. Single split may failure
5. Sorting problem

For overcome these all problems Hadoop introduce mapreduce in picture for analyzing such amount of data in fast.


What is MapReduce
  • MapReduce is a programming model for processing large data sets.
  • MapReduce is typically used to do distributed computing on clusters of computers.
  • The model is inspired by the map and reduce functions commonly used in functional programming.
  • Function output is dependent purely on the input data and not on any internal state. So for a given input the output is always guaranteed.
  • Stateless nature of the functions guarantees scalability.  
Key Features of MapReduce Systems
  • Provides Framework for MapReduce Execution
  • Abstract Developer from the complexity of Distributed Programming
  • Partial failure of the processing cluster is expected and tolerable.
  • Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry
  • MapReduce Programming Model is Language Independent
  • Automatic Parallelization and distribution
  • Fault Tolerance
  • Enable Data Local Processing
  • Shared Nothing Architecture Model
  • Manages inter-process communication
MapReduce Explained
  • MapReduce consist of 2 Phases or Steps
    • Map
    • Reduce
The "map" step takes a key/value pair and produces an intermediate key/value pair.

The "reduce" step takes a key and a list of the key's values and outputs the final key/value pair.

map reduce

  • MapReduce Simple Steps
    • Execute map function on each input received
    • Map Function Emits Key, Value pair
    • Shuffle, Sort and Group the outputs
    • Executes Reduce function on the group
    • Emits the output per group
Map Reduce WAY-

mapreduce way

1. Very big data convert in to splits
2. Splits are processed by mapper
3. Some partitioning functionality operated on the output of mapper
4. After that data move to Reducer and produce desire output

Anatomy of a MapReduce Job Run-
  • Classic MapReduce (MapReduce 1)
    A job run in classic MapReduce is illustrated in following Figure. At the highest level, there
    are four independent entities:
    • The client, which submits the MapReduce job.
    • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
    • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

  • YARN (MapReduce 2)
    MapReduce on YARN involves more entities than classic MapReduce. They are:
    • The client, which submits the MapReduce job.
    • The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
    • The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
    • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
    The process of running a job is shown in following Figure and described in the following sections.


Hadoop Confiuration

Hadoop Configuration
 I have to do in the following layers.

  • HDFS Layer
    • NameNode-Master
    • DataNode-Store Data(Actual Storage)
  • MapReduce Layer
    • JobTracker
    • TaskTracker
  • Secondary Namenode- storing backup of NameNode it will not work as an alternate namenode, it just stored namenode metadata

Types of Hadoop Configurations
  • Standalone Mode
    • All processes runs as single process
    • Preferred in development
  • Pseudo Cluster Mode
    • All processes run in different process but on a single machine
    • Simulate cluster
  • Fully Cluster Mode
    • All processes running on different boxes
    • Preferred in production Mode

What are important files to be configure
  • hadoop-env.sh (set java environment and logging file)
  • core-site.xml (configure namenode)
  • hdfs-site.xml (configure datanode)
  • mapred-site.xml (map reduce here taking responsibility of configuring jobTracker and taskTracker)
  • yarn-site.xml
  • master (file configured on each datanodes telling about its namenode)
  • slave (file configured on namenode telling what all slave of datanode it has to manage)

Hadoop Installation Tutorial (Hadoop 2.x)

This chapter explains how to set up Hadoop to run on a cluster of machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work they need to run on multiple nodes.

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves).

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found in Enabling Password-less ssh Login.

Install software needed by Hadoop

The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.

Install Java JDK on UBUNTU

Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.

 user@ubuntuvm:~$ sudo apt-get java-7-openjdk-i386
As an example in this tutorial, the JDK is installed into


You may need to make soft link to /usr/java/default from the actual location where you installed JDK.

Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$JAVA_HOME/bin:$PATH

Hadoop 2.x.x Configuration-

Step1. Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop-2.5.2.

You can unpack the tar ball to a directory. In this example, we unpack it to


which is a directory under the hadoop Linux user’s home directory.

Hadoop Installation Tutorial (Hadoop 2.x)

The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.

Step2. Configure environment variables for the “hadoop” user

We assume the “hadoop” user uses bash as its shell.

Add these lines at the bottom of ~/.bashrc on all nodes:
goto terminal >> sudo gedit .bashrc >> press enter >> put password "password"

Step3. Put following path to .bashrc file
 export HADOOP_HOME=$HOME/hadoop-2.5.2
export HADOOP_CONF_DIR=$HOME/hadoop-2.5.2/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.5.2
export HADOOP_COMMON_HOME=$HOME/hadoop-2.5.2
export HADOOP_HDFS_HOME=$HOME/hadoop-2.5.2
export YARN_HOME=$HOME/hadoop-2.5.2

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

export PATH=$PATH:$HOME/hadoop-2.5.2

Hadoop Installation Tutorial

Step4- Configure Hadoop-2.5.2 Important files

The configuration files for Hadoop is under /home/user/hadoop-2.5.2/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.

i- core-site.xml-

Here the NameNode runs on localhost.


ii- yarn-site.xml -

The YARN ResourceManager runs on localhost and supports MapReduce shuffle.

<!-- Site specific YARN configuration properties -->

iii- hdfs-site.xml-

The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.

1. First property create number of replication for datanode with name dfs.replication
2. Second property give the name of namenode directory dfs.namenode.name.dir with file directory you have to create at "/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode" 
3. Third property give the name of datanode directory dfs.datanode.name.dir with file directory you have to create "/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode" 

iv- mapred-site.xml-

First copy mapred-site.xml.template to mapred-site.xml and add the following content.

At terminal of Ubuntu $>> cp  mapred-site.xml.template  mapred-site.xml >> press enter



v- hadoop-env.sh Here we set java environment variable


 Step5. After configuring all five important files you have format the namenode with following command on ubuntu terminal.
user@ubuntuvm:~$ cd hadoop-2.5.2/bin/
user@ubuntuvm:~/hadoop-2.5.2/bin$ ./hadoop namenode -format


Step6. After formatting namenode we have to start all daemon services of hadoop-2.5.2
move to first /home/user/hadoop-2.5.2/sbin

 user@ubuntuvm:~$ cd hadoop-2.5.2/sbin/

i- datanode daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start datanode

ii- namenode daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start namenode
iii- resourcemanager daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start resourcemanager

iv- nodemanager daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start nodemanager

v- jobhistoryserver daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./mr-jobhistory-daemon.sh start historyserver

Step7. To verify all daemon services please write following command at terminal

 user@ubuntuvm:~/hadoop-2.5.2/sbin$ jps
3869 DataNode
4067 ResourceManager
4318 NodeManager
4449 JobHistoryServer
4934 NameNode
5389 Jps


Suppose if some of the services are not started yet please verify logs in logs folder of hadoop-2.5.2 at following location

 Here you can check each every file for logs

Step8. To verify all services at browser goto filefox of virtaul machine and open following url...



We can check two more files also
1. slaves-

For other Delete localhost and add all the names of the TaskTrackers, each in on line. For example:


2. master-

Hadoop Installation Tutorial (Hadoop 1.x)

Software Required-
Setup Virtual Machine

Step1. >goto traffictool.net->goto ubuntu(Ubuntu1404)->download it->extract it

Step2. Suppose your directory after extract it
"D:\personal data\hadoop\Ubuntu1404"

Step3. >goto google->search VMWARE PLAYER->goto result select DESKTOP & END USER->download it->install it

Step4. After installation of virtual machine goto-"D:\personal data\hadoop\Ubuntu1404\"

Step5. double click "Ubuntu.vmx" then virtual is running after that open now as following.


Step6. ->>Through the VM machine download hadoop release Hadoop-1.2.1(61M) >> extract it "hadoop-1.2.1.tar.gz"

Step7. In this tutorial Hadoop install into following location

Step8. ->>install Java in Linux
sudo apt-get install openjdk-7-jdk

Step9. In this tutorial JDK install into following location

Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).

Step10. Enable "hadoop" user to password-less SSH login to slaves-
Just for our convenience, make sure the "hadoop" user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found Enabling Password-less ssh Login.

Step11. Hadoop Configuration
Configure environment variables of “hadoop” user
Open terminal of command prompt and set environment  variable as follows

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export JAVA_INSTALL=/home/user/hadoop-1.2.1
 and Hadoop Path assign follows
export HADOOP_COMMON_HOME="/home/hadoop/hadoop/"

The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message "Hadoop common not found".

The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.

Step12. Configure Important files for Hadoop
A. /home/user/hadoop-1.2.1/conf/hadoop-env.sh
Add or change these lines to specify the JAVA_HOME and directory to store the logs:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export HADOOP_LOG_DIR=/home/user/hadoop-1.2.1/logs


B. /home/user/hadoop-1.2.1/conf/core-site.xml (configuring NameNode)
Here the NameNode runs on or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

C. /home/user/hadoop-1.2.1/conf/hdfs-site.xml (Configuring DataNode)
dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->




D. /home/user/hadoop-1.2.1/conf/mapred-site.xml (Configuring JobTracker)
Here the JobTracker runs on or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->





mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

E. /home/user/hadoop-1.2.1/conf/slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

F. Start Hadoop
We need to start both the HDFS and MapReduce to start Hadoop.

1. Format a new HDFS
On NameNode
$ hadoop namenode -format
Remember to delete HDFS’s local files on all nodes before re-formating it:
$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf

2. Start HDFS
On NameNode :

$ start-dfs.sh

3.Check the HDFS status:
On NameNode :

$ hadoop dfsadmin -report
There may be less nodes listed in the report than we actually have. We can try it again.

4. Start mapred:
On JobTracker:

$ start-mapred.sh

5.Check job status:

$ hadoop job -list

Shut down Hadoop cluster

We can stop Hadoop when we no long use it.

Stop HDFS on NameNode:

$ stop-dfs.sh

Stop JobTracker and TaskTrackers on JobTracker:

$ stop-mapred.sh

Enabling Password less ssh login

Enabling Linux Automatic Password-less SSH Login

Automatic passwrod-less ssh login can make our life easier. To enable this, we need to copy our SSH public keys to the remote machines for automatic password-less login. We introduce two methods in this post: using ssh-copy-id command and the manual way.

Generate SSH key pair

If you do not have a SSH private/public key pair, let’s generate one first.

$ ssh-keygen -t rsa

By default on Linux, the key pair is stored in ~/.ssh (id_rsa and id_rsa.pub for private and public key).

Copy public SSH key to the remote machine

You have two choices here. Unless that you can not use the ssh-copy-id method, you can try the “manual” way.

The easiest way

Let ssh-copy-id do it automatically:

$ ssh-copy-id username@remotemachine

If you have multiple keys in your ~/.ssh directory, you may need to use -i key_file to specify which key you will use.

The manual way

Copy the public SSH key to remote machine

$ scp .ssh/id_rsa.pub username@remotemachine:/tmp/

Log on the remote machine

$ ssh username@remotemachine

Append your public SSH key to ~/.ssh/authorized_keys

$ cp ~/.ssh/authorized_keys ~/.ssh/authorized_keys.bak # backing up before changing is a good habit
$ cat /dev/shm/id_rsa.pub >> ~/.ssh/authorized_keys # append pub key to authorized keys list

Make sure the mode of ~/.ssh/authorized_keys is 755:

$ chmod 755 ~/.ssh/authorized_keys

Possible Problems

Home directory permission

Check the home directory’s permission which may cause the key-based login fail (suppose the home directory is /home/zma):

# chmod 700 /home/zma/

JobTracker and TaskTracker Design

JobTracker and TaskTracker are coming into picture when we required processing to data set. In hadoop system there are five services always running in background (called hadoop daemon services).
Daemon Services of Hadoop-
  1. Namenodes
  2. Secondary Namenodes
  3. Jobtracker
  4. Datanodes
  5. Tasktracker

Above three services 1, 2, 3 can talk to each other and other two services 4,5 can also talk to each other. Namenode and datanodes are also talking to each other as well as Jobtracker and Tasktracker are also.
Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

JobTracker and TaskTrackers Work Flow-


1. User copy all input files to distributed file system using namenode meta data.
2. Submit jobs to client which applied to input files fetched stored in datanodes.
3. Client get information about input files from namenodes to be process.
4. Client create splits of all files for the jobs
5. After splitting files client stored meta data about this job to DFS.
6. Now client submit this job to job tracker.


7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to mappers. Same number of mapper are there as many input splits are there. Every map work on individual split and create output.


10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker working properly or not. This heartbeat frequently sent to JobTracker in 3 second by every TaskTrackers. If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a dead state and upate metadata about those task trackers.
11. Picks tasks from splits.
12. Assign to TaskTracker.


Finally all tasktrackers create outputs and number of reduces generate as number of outputs created by task trackers. After all reducer give us final output.

HDFS Architecture

Hi in this hadoop tutorial we will describing now HDFS Architecture. There are following are two main components of HDFS.

Main Components of HDFS-
  • NameNodes
    • master of the system
    • maintain and manage the blocks which are present on the datanodes
    • Namenode is like above pic "lamborghini car" strong with body, is single point failure point

  • DataNodes
    • slaves which are deployed on each machine and provide the actual storage
    • responsible for serving read and write request for the clients
    • Datanodes are like above pic where like "ambassador cars" some less strong compare to "lamborghini" but have actual point of service providers, is commodity hardware.
HDFS Architecture-
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
HDFS Architecture 
Rack-storage area where we store multiple datanodes
client-is application which you used to intract with NameNode and DataNode

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

NameNode Metadata-

Meta data in memory
  • the entire meta-data in main memory
  • no demand paging for FS meta-data
Types of Meta-data
  • List of files
  • List of blocks of each file
  • List of data nodes of each block
  • File attributes e.g. access time, replication factors
A Transaction Log
  • Recode file creation and deletion

Secondary Name Node-storing backup of NameNode it will not work as an alternate of namenode, it just stored namenode metadata.

secondary namenodes

HDFS Client Create a New File-

When an application reads a file, the HDFS client first asks the NameNode for the list of DataNodes that host replicas of the blocks of the file. The list is sorted by the network topology distance from the client. The client contacts a DataNode directly and requests the transfer of the desired block. When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in following figure.

HDFS client

Unlike conventional filesystems, HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factor of a file. By default a file's replication factor is three. For critical files or files which are accessed very often, having a higher replication factor improves tolerance against faults and increases read bandwidth.

Anatomy of File Write by HDFS Client-


1. Create input splits by HDFS client.
2. After that it goes to namenode and namenode give the information back to which datanode to be selected.
3. Then client to be write the data pack to Datanode no one else, Namenode does not write any thing datanodes.
4. Data written to datanodes in a pipeline by HDFS client.
5. and every datanodes return ack packet i.e. acknowledgement back to HDFS client (Non-posted write here all writes are asynchronous).
6. close connection with datanodes.
7. confirm about completion to namenode.

Anatomy of File Read by HDFS Client-

HDFS read Anatomy

1. User ask to HDFS client to read a file and Client move request to NameNode.
2. NameNode give block information which data node has the file.
3. and then client goes to read data from datanodes.
4. Client reading data from all datanodes in parallel(for fast accessing data in case of any failure of any datanode that is why  hadoop read data in parallel way) way not in pipeline.
5.  Reading data from every datanodes where same file exists.
6. After reading is complete then close the connection with datanode cluster.

What is HDFS?

What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
  • Highly fault-tolerant 
    "Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS."
  • High throughput
    "HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future."
  • Suitable for application with large data sets
    "Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance."
  • Streaming access to file system data
    "Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access."
  • can be built out of commodity hardware

Areas where HDFS is Not a Good Fit Today-
  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

HDFS Components-
  • NameNodes
    • -associated with JobTracker
    • -master of the system
    • maintain and manage the blocks which are present on the DataNodes

    The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.
  • DataNodes
    • - associated with TaskTracker
    • -salves which are deployed on each machine and provide the actual storage
    • -responsible for serving read and write request f
    • or the clients

    Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.

Hadoop Architecture

Here we will describe about Hadoop Architecture. In high level of hadoop architecture there are two main modules HDFS and MapReduce.

               Means HDFS + MapReduce = Hadoop Framework

Following pic have high level architecture of hadoop version 1 and version 2-
Hadoop Architecture

Hadoop provides a distributed filesystem(HDFS) and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.
The Apache Hadoop framework is composed of the following modules :
1] Hadoop Common - contains libraries and utilities needed by other Hadoop modules

2] Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

3] Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

4] Hadoop MapReduce - a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others
For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

Core Components of Hadoop 1.x(HDFS & MapReduce) :
There are two primary components at the core of Apache Hadoop 1.x : the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These open source projects, inspired by technologies created inside Google.
 Hadoop Distributed File System (HDFS)-Storage
  • Distributed across "nodes"
  • Natively redundant
  • NameNode track location

  • Split a task across processors
  • near data and assembles results
  • self healing and high bandwidth
  • clustered storage
  • JobTracker manages the TaskTracker

NameNode is admin node, is associated with Job Tracker, is master slave architecture.

JobTracker is associated with NameNode with multiple task tracker for processing of data sets.

What is Hadoop?

What is Hadoop? first of all we are understanding what is DFS(Distributed File System), Why DFS?

 DFS(Distributed File Systems)-

A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.

In above pics there are different physical machines in different location but in one logical machine have a common file system for all physical machine.

  • System that permanently store data
  • Divided into logical units (files, shards, chunks, blocks etc)
  • A file path joins file and directory names into a relative or absolute relative address to identify a file
  • Support access to files and remote servers
  • Support concurrency 
  • Support Distribution
  • Support Replication
  • NFS, GPFS, Hadoop DFS, GlusterFS, MogileFS...


why dfs

What is Hadoop?
Apache Hadoop is a framework that allow for the distributed processing for large data sets across clusters of commodity computers using simple programing model.

It is design to scale up from a single server to thousands of machines each offering local computation and storage.

Apache Hadoop is simply a framework, it is library which build using java with objective of providing capability of managing huge amount of data.

Hadoop is a java framework providing by Apache hence to manage huge amount of data by providing certain components which have capability of understanding data providing the right storage capability and providing right algorithm to do analysis to it.

Open Source Software + Commodity Hardware = IT Costs reduction

What is Hadoop used for?
  • Searching
  • Log Processing
  • Recommendation systems
  • Analytics
  • Video and Image Analysis
  • Data Retention

Company Using Hadoop:
  • Yahoo
  • Google
  • Facebook
  • Amazon
  • AOL
  • IBM
  • other mores