REST and SOAP Web Service Interview Questions

In this interview questions tutorial we will explain most asking interviews questions on the web services like SOAP, REST etc and its protocol support. REST is getting popular day by day and replacing SOAP web services which was standard earlier and Interviewer expect you to know about REST and how it work.

Define Web Service?
A web service is a kind of software that is accessible on the Internet. It makes use of the XML messaging system and offers an easy to understand, interface for the end users.

Popular Spring Tutorials

  1. Spring Tutorial
  2. Spring MVC Web Tutorial
  3. Spring Boot Tutorial
  4. Spring JDBC Tutorial
  5. Spring AOP Tutorial
  6. Spring Security Tutorial

What is REST and RESTful web services ?
REST stands for REpresentational State Transfer (REST) its a relatively new concept of writing web services which enforces a stateless client server design where web services are treated as resource and can be accessed and identified by there URL unlike SOAP web services which were defined by WSDL.

Web services written by apply REST Architectural concept are called RESTful web services which focus on System resources and how state of Resource should be transferred over http protocol to a different clients written in different languages. In RESTful web services http methods like GET, PUT, POST and DELETE can can be used to perform CRUD operations.

What is differences between RESTful web services and SOAP web services ?
Though both RESTful web series and SOAP web service can operate cross platform they are architecturally different to each other, here is some of differences between REST and SOAP:

1) REST is more simple and easy to use than SOAP
2) REST uses HTTP protocol for producing or consuming web services while SOAP uses XML.
3) REST is lightweight as compared to SOAP and preferred choice in mobile devices and PDA’s.
4) REST supports different format like text, JSON and XML while SOAP only support XML.
5) REST web services call can be cached to improve performance.

What is Restlet framework ?
Restlet is leading RESTful web framework for Java applications is used to build RESTFul web services it has two part Restlet API and a Restlet implementation much like Servlet specification. There are many implementation of Restlet framework available you just need to add there jar in your classpath to use them. By using Restlet web framework you can write client and server.

What is Resource in REST framework ?
it represent a “resource” in REST architecture. on RESTLET API it has life cycle methods like init(), handle() and release() and contains a Context, Request and Response corresponding to specific target resource. This is now deprecated over ServerResource class and you should use that. see Restlet documentation for more details.

Can you use Restlet without any web-container ?
Yes, Restlet framework provide default server which can be used to handle service request in web container is not available.

What are the tools used for creating RESTFull web services ?
You can use AJAX(Asynchronous JavaScript with XAML) and Direct Web Removing to consume web serives in web application. Both Eclipse and NetBeans also supported development of RESTFul services.

How to display custom error pages using RestFull web services ?
In order to customize error you need to extend StatusService and implement getRepresentation(Status, Request, Response) method with your custom code now assign instance of your CustomStatusService to appropriate “statusService property”.

Which HTTP methods are supported by RestFull web services ?
Another common REST interview questioning RESTFul web service each Resource supports GET, POST, PUT and DELETE http methods.GET is mapped to represent(), POST – acceptRepresentation(), PUT- storeRepresentation and DELET for rmeoveRepresentation.

What is difference between top-down and bottom-up approach of developing web services ?
In top-down approach first WSDL document is created and than Java classes are developed based on WSDL contract, so if WSDL contract changes you got to change your Java classes while in case of bottom up approach of web service development you first create Java code and then use annotations like @WebService to specify contract or interface and WSDL field will be automatically generated from your build.

Define SOAP?
SOAP is an XML based protocol to transfer between computers.

Define WSDL?
It means Web Services Description Language. It is basically the service description layer in the web service protocol stock. The Service Description layer describes the user interface to a web service.

Differentiate between a SOA and a Web service?
SOA is a design and architecture to implement other services. SOA can be easily implemented using various protocols such as HTTP, HTTPS, JMS, SMTP, RMI, IIOP, RPC etc. While Web service, itself is an implemented technology. In fact one can implement SOA using the web service.

Discuss various approaches to develop SOAP based web service?
We can develop SOAP based web service with two different types of approaches such as contract-first and contract-last. In the first approach, the contract is defined first and then the classes are derived from the contract while in the later one, the classes are defined first and then the contract is derived from these classes.

If you have to choose one approach, then what will be your choice?
In my point of view, the first approach that is the contract-first approach is more feasible as compared to the second one but still it depends on other factors too.

What are the types of information included in SOAP header?
Header of SOAP contains information like that,
1. In SOAP header client should handle authentication and transaction.
2. The SOAP message should process by client.
3. EncodingStyle is also has in header.

What are the disadvantages of SOAP?
Some disadvantages .
1. It is much slower than middleware technologies.
2. Because we used HTTP for transporting messages and not use to defined ESB or WS-Addressing interaction of parties over a message is fixed.
3. Application protocol level is problematic because usability of HTTP for different purposes is not present.

What’s New in Spring Batch 3.0

The Spring Batch 3.0 release has five major themes:

  • JSR-352 Support
  • Upgrade to Support Spring 4 and Java 8
  • Promote Spring Batch Integration to Spring Batch
  • JobScope Support
  • SQLite Support

JSR-352 Support-

JSR-352 is the new java specification for batch processing. Heavily inspired by Spring Batch, this specification provides similar functionality to what Spring Batch already supports. However, Spring Batch 3.0 has implemented the specification and now supports the definition of batch jobs in compliance with the standard. An example of a batch job configured using JSR-352’s Job Specification Language (JSL) would look like below:

<?xml version="1.0" encoding="UTF-8"?>
<job id="myJob3" xmlns="" version="1.0">
    <step id="step1" >
        <batchlet ref="testBatchlet" />

It is important to point out that Spring Batch does not just implement JSR-352. It goes much further than the spec in a number of ways:

  • Components – Spring Batch provides 17 different ItemReader implementations, 16 ItemWriter implementations, and many other components that have years of testing in production environments under their belts.
  • Scalability – JSR-352 provides scaling options for a single JVM only (partitioning and splits both via threads). Spring Batch provides multi-JVM scalability options including remote partitioning and remote chunking.
  • Spring dependency injection – While JSR-352 provides a form of “dependency injection light”, there are a number of limitations that it places on the construction of batch artifacts (must use no-arg constructors for example). Spring Batch is built on Spring and benefits from the power of the Spring Framework’s capabilities.
  • Java based configuration – While Spring’s XML based configuration options are well known, Spring and specifically Spring Batch, provide the option to configure your jobs using the type safety of java based configuration.
  • Hadoop/Big Data integration – Spring Batch is a foundational tool for interacting with Hadoop and other big data stores in the Spring ecosystem. Spring for Apache Hadoop provides a number of batch related extensions to use Spring Batch to orchestrate work on a Hadoop cluster. Spring XD builds on Spring Batch by providing both execution capabilities, but also management functionality similar to Spring Batch Admin for any environment.

Spring will continue to participate in the evolution of JSR-352 as it goes through maintenance revisions and look forward to further contributions to the JCP process.

Upgrade to Support Spring 4 and Java 8

With the promotion of Spring Batch Integration to be a module of the Spring Batch project, it has been updated to use Spring Integration 4. Spring Integration 4 moves the core messaging APIs to Spring core. Because of this, Spring Batch 3 will now require Spring 4 or greater.

As part of the dependency updates that have occurred with this major release, Spring Batch now supports being run on Java 8. It will still execute on Java 6 or higher as well.

Promote Spring Batch Integration to Spring Batch

Spring Batch Integration has been a sub module of the Spring Batch Admin project now for a few years. It provides functionality to better integrate the capabilities provided in Spring Integration with Spring Batch. Specific functionality includes:

  • Asynchronous ItemProcessor/ItemWriter – Executes the ItemProcessor logic on another thread, returning a Future to the ItemWriter. Once the Future returns, the result is written.
  • JobLaunchingMessageHandler/JobLaunchingMessageGateway – Provides the ability to launch jobs via Spring Messages received over channels.
  • Remote Chunking – Provides the ability to execute ItemProcessor logic remotely (across multiple JVMs) via a master/slave configuration.
  • Remote Partitioning – Provides the ability to execute full chunks remotely (read/process/write across multiple JVMs) via a master/slave configuration.

JobScope Support

The Spring scope “step” used in Spring Batch has had a pivotal role in batch applications, providing late binding functionality for a long time now. With the 3.0 release Spring Batch now supports a “job” scope. This new scope allows for the delayed construction of objects until a Job is actually launched as well as providing a facility for new instances for each execution of a job.

SQLite Support

SQLite has been added as a newly supported database option for the JobRepository by adding job repository ddl for SQLite. This provides a useful, file based, data store for testing purposes.


Hadoop Tutorial

Hadoop Tutorial

Hi in this hadoop tutorial we will describe all about Hadoop, why use Hadoop, Hadoop Architecture, BigData, MapReduce and Some ecosystems.

Now a days required framework like which  handle huge amount of data in an application like Facebook,  Twitter, LinledIn, Google, Yahoo, etc these have lots of data. These companies required some process to that huge data like 1. Data Analysis, 2. Proper Handling of Data and 3. Understandable data to custom format.

Apache Hadoop’s MapReduce and HDFS components originally derived respectively from Google’s MapReduce and Google File System (GFS) papers.

In 2003-2004 Google Introduced some new technique in search engine 1. File System GFS (Google File System) and another framework for data analyzing technique called 2. MapReduce to make fast searching and fast analyzing data. Google just submitted theses white paper to search engine.

In 2005-2006 Yahoo take these technique for Google and Implement in single framework given Name Hadoop. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant. It was originally developed to support distribution for the Nutch search engine project. No one knows that better than Doug Cutting, chief architect of Cloudera and one of the curious story behind Hadoop. When he was creating the open source software that supports the processing of large data sets, Cutting knew the project would need a good name. Cutting’s son, then 2, was just beginning to talk and called his beloved stuffed yellow elephant “Hadoop” (with the stress on the first syllable). Fortunately, he had one up his sleeve—thanks to his son. The son (who’s now 12) frustrated with this. He’s always saying ‘Why don’t you say my name, and why don’t I get royalties? I deserve to be famous for this 🙂

Big Data Hadoop 

In this Hadoop Tutorial we will learn further following modules.

Module 1-

  1. Understanding Big Data
  2.  Introduction to Hadoop
  3.  Hadoop Architecture

Module 2-

  1. Hadoop Distributed File System(HDFS)
  2. HDFS Architecture
  3. JobTracker and TaskTracker Architecture 
  4. Hadoop Configuration 
  5. Hadoop Environment Setup(Hadoop 1.x)
  6. Hadoop Environment Setup(Hadoop 2.x)
  7. Hadoop installation on ubuntu
  8. Data Loading Technique

Module 3-

  1. Introduction to MapReduce
  2. MapReduce Flow Chart with Sample Example 
  3. MapReduce Programming Hello World or WordCount Program 

Module 4-

  1. Advanced MapReduce
  2. YARN
  3. YARN Programming

Module 5-

  1. Introduction to Sqoop
  2. Programming with Sqoop

Module 6-

  1. Analytic Using HIVE
  2. Understanding HIVE QL

Module 7-

  1. NoSQL Databases
  2. Understanding HBASE
  3. Zookeeper

Module 8-

  1. Real World Data sets and analysis
  2. Project Discussion

MapReduce Programming Hello World Job

In the Hadoop and MapReduce tutorial we will see how to create hello world job and what are the steps to creating a mapreduce program. There are following steps to creating mapreduce program.

Step1- Creating a file
J$ cat>file.txt
hi how are you
how is your job
how is your family
how is your brother
how is your sister
what is the time now
what is the strength of hadoop

Step2- loading file.txt from local file system to HDFS
J$ hadoop fs -put file.txt  file

Step3- Writing programs


Step4- Compiling all above .java files

J$ javac -classpath $HADOOP_HOME/hadoop-core.jar *.java

Step5- Creating jar file

J$ jar cvf job.jar *.class

Step6- Running above job.jar on file (which there in HDFS)

J$ hadoop jar job.jar DriverCode file TestOutput

Lets start with actual code for these steps above.

Hello World Job -> WordCountJob

1. DriverCode (

package com.doj.hadoop.driver;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

 * @author Dinesh Rajput
public class WordCount extends Configured implements Tool {
 public int run(String[] args) throws Exception {
   if (args.length != 2) {
       System.err.printf("Usage: %s [generic options] <input> <output>n",
       return -1;
   JobConf conf = new JobConf(WordCount.class);
   conf.setJobName("Word Count");
   FileInputFormat.addInputPath(conf, new Path(args[0]));
   FileOutputFormat.setOutputPath(conf, new Path(args[1]));
   return conf.waitForCompletion(true) ? 0 : 1;
 public static void main(String[] args) throws Exception {
   int exitCode = WordCount(), args);

2. MapperCode (

package com.doj.hadoop.driver;

import java.util.StringTokenizer;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

 * @author Dinesh Rajput
public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();
 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
  throws IOException
  String line = value.toString();
  StringTokenizer tokenizer = new StringTokenizer(line);
  while (tokenizer.hasMoreTokens())
   output.collect(word, one);

3. ReducedCode (

package com.doj.hadoop.driver;

 * @author Dinesh Rajput
import java.util.Iterator;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
   Reporter reporter) throws IOException
  int sum = 0;
  while (values.hasNext())
   sum +=;
  output.collect(key, new IntWritable(sum));

MapReduce Flow Chart Sample Example

In this mapreduce tutorial we will explain mapreduce sample example with its flow chart. How to work mapreduce for a job.


  • We have a large collection of text documents in a folder. (Just to give a feel size.. we have 1000 documents each with average of 1 Millions words)
  • What we need to calculate:-
    • Count the frequency of each distinct word in the documents?
  • How would you solve this using simple Java program?
  • How many lines of codes will u write?
  • How much will be the program execution time?

To overcome listed above problems into some line using mapreduce program. Now we look into below mapreduce function for understanding how to its work on large dataset.


  • Map Functions operate on every key, value pair of data and transformation logic provided in the map function.
  • Map Function always emits a Key, Value Pair as output
       Map(Key1, Valiue1) –> List(Key2, Value2)
  • Map Function transformation is similar to Row Level Function in Standard SQL
  • For Each File
    • Map Function is
      • Read each line from the input file
        • Tokenize and get each word
          • Emit the word, 1 for every word found

The emitted word, 1 will from the List that is output from the mapper

So who take ensuring the file is distributed and each line of the file is passed to each of the map function?-Hadoop Framework take care about this, no need to worry about the distributed system.


  • Reduce Functions takes list of value for every key and transforms the data based on the (aggregation) logic provided in the reduce function.
  • Reduce Function
        Reduce(Key2, List(Value2)) –> List(Key3, Value3)
  • Reduce Functions is similar to Aggregate Functions in Standard SQL

Reduce(Key2, List(Value2)) –> List(Key3, Value3)

For the List(key, value) output from the mapper Shuffle and Sort the data by key
Group by Key and create the list of values for a key

  • Reduce Function is
    • Read each key (word) and list of values (1,1,1..) associated with it.
      • For each key add the list of values to calculate sum
        • Emit the word, sum for every word found

So who is ensuring the shuffle, sort, group by etc?


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

context.write(word, one);


public void reduce(Text key, Iterable <IntWritable> values, Context context) throws IOException, InterruptedException{

int sum = 0;
for(IntWritable val : values){
sum += val.get();
context.write(key, new IntWritable(sum));



Suppose we have a file with size about 200 MB, suppose content as follows

_______File(200 MB)____________
hi how are you
how is your job (64 MB) 1-Split
how is your family
how is your brother (64 MB) 2-Split
how is your sister
what is the time now (64 MB) 3-Split
what is the strength of hadoop (8 MB) 4-Split

In above file we have divided this file into 4 splits with sizes three splits with size 64 MB and last fourth split with size 8 MB.

Input File Formats:
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat


Lets see in another following figure to understand the process of MAPREDUCE.


Saving Objects using Hibernate APIs

Introduction to Hibernate

  • An ORM tool.
  • Used in the data layer of application.
  • Implement JPA.
  • JPA (Java persistence api):- this is set of standards for and java persistence api.

The problem with JDBC

  • Mapping member variables to the column.
  • Mapping relationships.
  • Handling data types.
  • Example Boolean most of the databases do not have Boolean datatype.
  • Managing the state of object.

Saving with an object with hibernate.

  1. JDBC database configuration >> Hibernate configuration .cfg.xml file.
  2. The model object >> Annotations or Hibernate mapping file .hbm.xml files.
  3. Service method to create the model object >> Use the hibernate.
  4. Database design. >> No needed.
  5. DAO method to save the object using sql queries >> No needed.
Now, i am going to create a simple POJO class with some properties and their setters and getters.

package com.bloggers.model;
import java.util.Date;
public class Book
 /* for primary key in database table */
 private int bookId;
 private String bookName;
 private Date date;
 public int getBookId()
  return bookId;
 public void setBookId(int bookId)
  this.bookId = bookId;
 public String getBookName()
  return bookName;
 public void setBookName(String bookName)
  this.bookName = bookName;
 public Date getDate()
  return date;
 public void setDate(Date date)
 { = date;

Few point while creating a model object or JavaBeans ( class.

  • bookId property holds a unique identifier value for a book object. Provides an identifier property if you want to use the full feature set of Hibernate
  • Application need to distinguish objects by identifier.
  • Hibernate can access public, private, and protected accessor methods, as well as public, private and protected fields directly.
  • no-argument constructor is a requirement for all persistent classes. Hibernate create objects using Java Reflection.

Mapping file :-hbm file (Hibernate Mapping file)
Now second step is we have to create a hbm file with extension .hbm.xml. HBM file means Hibernate Mapping File. Mapping file gives all the information to hibernate about table columns etc which are mapped to object.
All persistent entity classes need a mapping to a table in the SQL database.
Basic structor of hbm file

<hibernate -mapping="-mapping">


  • Hibernate not loads the DTD file from the web, it first look it up from the classpath of the application.
  • DTD file is included in hibernate-core.jar

Mapping file of class(Book.hbm.xml)
I will complete this step by step and try to explain all the elements.
1. Add a class element between two hibernate-mapping tags.

<?xml version="1.0"?>
   <!DOCTYPE hibernate-mapping PUBLIC
    "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
   <class name="com.bloggers.model.Book" >

class=”com.bloggers.model.Book” .Specify the qualified class name in name attribute of class element.

2. Next, we need to tell Hibernate about the remaining entity class properties. By default, no properties of the class are considered persistent.

<?xml version="1.0"?>
   <!DOCTYPE hibernate-mapping PUBLIC
     "-//Hibernate/Hibernate Mapping DTD 3.0//EN"
    <class name="com.bloggers.model.Book" >
     <id name="bookId"/>
     <property name="date" column="PUBLISH_DATE"/>

name=”bookId” attribute of the property element tells Hibernate which getter and setter methods to use in our case hibernate will search for getBookId() , setBookId() etc methods.

Question : Why does the date property mapping include the column attribute, but the bookID does not?
Answer : Without the column attribute, Hibernate by default uses the property name as the column name. This works for bookId, however, date is a reserved keyword in most databases so you will need to map it to a different name.

Create a Hibernate Configuration File (hibernate.cfg.xml)
Hibernate required a configuration file for making the connection with database by default the file name is hibernate.cfg.xml

We can do hibernate configuration in 3 different ways and these are :-

  • we can use a simple file
  • a more sophisticated hibernate.cfg.xml file
  • or even complete programmatic setup

I am going to use the 2 option means by using hibernate.cfg.xml file.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE hibernate-configuration PUBLIC
  "-//Hibernate/Hibernate Configuration DTD 3.0//EN"
  <!-- Database connection settings -->
  <property name="connection.driver_class">com.mysql.jdbc.Driver</property>
  <property name="connection.url">jdbc:mysql://localhost:3306/hibernatedb</property>
  <property name="connection.username">root</property>
  <property name="connection.password">root</property>
  <!-- JDBC connection pool (use the built-in) -->
  <property name="connection.pool_size">1</property>
  <!-- SQL dialect -->
  <property name="dialect">org.hibernate.dialect.MySQL5Dialect</property>
  <!-- Disable the second-level cache  -->
  <property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>
  <!-- Echo all executed SQL to stdout -->
  <property name="show_sql">true</property>
  <!-- Drop and re-create the database schema on startup -->
  <property name="">create</property>
  <mapping resource="com/bloggers/model/Book.hbm.xml"/>
  1. SessionFactory is a global factory responsible for a particular database. If you have several databases, for easier startup you should use several configurations in several configuration files.
  2. The first four property elements contain the necessary configuration for the JDBC connection.
  3. The dialect property element specifies the particular SQL variant Hibernate generates.
  4. In most cases, Hibernate is able to properly determine which dialect to use.
  5. The option turns on automatic generation of database schema directly into the database. This can also be turned off by removing the configuration option, or redirected to a file with the help of the SchemaExport Ant task.
  6. Finally, add the mapping file(s) for persistent classes to the configuration.
Utility class for getting the sessionFactory.

package com.bloggers.util;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;
public class HibernateUtil
 private static final SessionFactory sessionFactory = buildSessionFactory();
 private static SessionFactory buildSessionFactory()
   // Create the SessionFactory from hibernate.cfg.xml
   return new Configuration().configure().buildSessionFactory();
  catch (Throwable ex)
   // Make sure you log the exception, as it might be swallowed
   System.err.println("Initial SessionFactory creation failed." + ex);
   throw new ExceptionInInitializerError(ex);
 public static SessionFactory getSessionFactory()
  return sessionFactory;

Run the example
Create a simple class for running the example.

package com.bloggers;
import java.util.Date;
import org.hibernate.Session;
import org.hibernate.SessionFactory;
import com.bloggers.model.Book;
import com.bloggers.util.HibernateUtil;
public class HibernateTest
 public static void main(String[] args)
  HibernateTest hibernateTest = new HibernateTest();
  /* creating a new book object */
  Book book = new Book();
  book.setDate(new Date());
  book.setBookName("Hibernate in action");
  /* saving the book object */
 public void saveBook(Book book)
  SessionFactory sessionFactory = HibernateUtil.getSessionFactory();
  Session session = sessionFactory.openSession();
  /* saving the book object here */;

Before running the HibernateTest class make sure you have created a schema named hibernatedb in MySQL database .

Question : Why schema named hibernatedb ?
Answer : Because in hibenate.cfg.xml

<property name=”connection.url”>jdbc:mysql://localhost:3306/hibernatedb</property>

we have specify hibernatedb. You are free to change whatever name you want. Create schema using mysql>create schema hibernate; command.

Saving Objects using Hibernate APIs

Question : Did you see the table it does not contains the column name bookName?
Answer : If yes then you can easily found that why we do not have bookName in table book and the answer is because we did not map the bookName property in our hbm file so map the property and run the main method again and you will observe that.

  • Hibernate automatically add a new column in book table.
  • When you run the select query on database again you see that it also contain bookName

Introduction to MapReduce

In this hadoop tutorial we will introduce map reduce, what is map reduce. Before map reduce how to analyze the bigdata. Please look into following picture.

Introduction to MapReduce

Here bigdata split into equal size and grep it using linux command and matches with some specific characters like high temperature of any large data set of weather department. But this way have some problems as follows.

Problems in the Traditional way analysis-

1. Critical path problem (Its amount of time to take to finish the job without delaying the next milestone or actual completion date).
2. Reliability problem
3. Equal split issues
4. Single split may failure
5. Sorting problem

For overcome these all problems Hadoop introduce mapreduce in picture for analyzing such amount of data in fast.


What is MapReduce

  • MapReduce is a programming model for processing large data sets.
  • MapReduce is typically used to do distributed computing on clusters of computers.
  • The model is inspired by the map and reduce functions commonly used in functional programming.
  • Function output is dependent purely on the input data and not on any internal state. So for a given input the output is always guaranteed.
  • Stateless nature of the functions guarantees scalability.  

Key Features of MapReduce Systems

  • Provides Framework for MapReduce Execution
  • Abstract Developer from the complexity of Distributed Programming
  • Partial failure of the processing cluster is expected and tolerable.
  • Redundancy and fault-tolerance is built in, so the programmer doesn’t have to worry
  • MapReduce Programming Model is Language Independent
  • Automatic Parallelization and distribution
  • Fault Tolerance
  • Enable Data Local Processing
  • Shared Nothing Architecture Model
  • Manages inter-process communication

MapReduce Explained

  • MapReduce consist of 2 Phases or Steps
    • Map
    • Reduce

The “map” step takes a key/value pair and produces an intermediate key/value pair.

The “reduce” step takes a key and a list of the key’s values and outputs the final key/value pair.

map reduce

  • MapReduce Simple Steps
    • Execute map function on each input received
    • Map Function Emits Key, Value pair
    • Shuffle, Sort and Group the outputs
    • Executes Reduce function on the group
    • Emits the output per group

Map Reduce WAY-

mapreduce way

1. Very big data convert in to splits
2. Splits are processed by mapper
3. Some partitioning functionality operated on the output of mapper
4. After that data move to Reducer and produce desire output

Anatomy of a MapReduce Job Run-

  • Classic MapReduce (MapReduce 1)
    A job run in classic MapReduce is illustrated in following Figure. At the highest level, there
    are four independent entities:
    • The client, which submits the MapReduce job.
    • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
    • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

  • YARN (MapReduce 2)
    MapReduce on YARN involves more entities than classic MapReduce. They are:
    • The client, which submits the MapReduce job.
    • The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
    • The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
    • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
    The process of running a job is shown in following Figure and described in the following sections.


Hadoop Confiuration

Hadoop Configuration
 I have to do in the following layers.

  • HDFS Layer
    • NameNode-Master
    • DataNode-Store Data(Actual Storage)
  • MapReduce Layer
    • JobTracker
    • TaskTracker
  • Secondary Namenode– storing backup of NameNode it will not work as an alternate namenode, it just stored namenode metadata

Types of Hadoop Configurations

  • Standalone Mode
    • All processes runs as single process
    • Preferred in development
  • Pseudo Cluster Mode
    • All processes run in different process but on a single machine
    • Simulate cluster
  • Fully Cluster Mode
    • All processes running on different boxes
    • Preferred in production Mode

What are important files to be configure

  • (set java environment and logging file)
  • core-site.xml (configure namenode)
  • hdfs-site.xml (configure datanode)
  • mapred-site.xml (map reduce here taking responsibility of configuring jobTracker and taskTracker)
  • yarn-site.xml
  • master (file configured on each datanodes telling about its namenode)
  • slave (file configured on namenode telling what all slave of datanode it has to manage)

Hadoop Installation Tutorial (Hadoop 2.x)

This chapter explains how to set up Hadoop to run on a cluster of machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work they need to run on multiple nodes.

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves).

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found in Enabling Password-less ssh Login.

Install software needed by Hadoop

The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.

Install Java JDK on UBUNTU

Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.

 user@ubuntuvm:~$ sudo apt-get java-7-openjdk-i386
As an example in this tutorial, the JDK is installed into


You may need to make soft link to /usr/java/default from the actual location where you installed JDK.

Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$JAVA_HOME/bin:$PATH

Hadoop 2.x.x Configuration-

Step1. Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop-2.5.2.

You can unpack the tar ball to a directory. In this example, we unpack it to


which is a directory under the hadoop Linux user’s home directory.

Hadoop Installation Tutorial (Hadoop 2.x)

The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.

Step2. Configure environment variables for the “hadoop” user

We assume the “hadoop” user uses bash as its shell.

Add these lines at the bottom of ~/.bashrc on all nodes:
goto terminal >> sudo gedit .bashrc >> press enter >> put password “password

Step3. Put following path to .bashrc file
 export HADOOP_HOME=$HOME/hadoop-2.5.2
export HADOOP_CONF_DIR=$HOME/hadoop-2.5.2/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.5.2
export HADOOP_COMMON_HOME=$HOME/hadoop-2.5.2
export HADOOP_HDFS_HOME=$HOME/hadoop-2.5.2
export YARN_HOME=$HOME/hadoop-2.5.2

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

export PATH=$PATH:$HOME/hadoop-2.5.2

Hadoop Installation Tutorial

Step4- Configure Hadoop-2.5.2 Important files

The configuration files for Hadoop is under /home/user/hadoop-2.5.2/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.

i- core-site.xml

Here the NameNode runs on localhost.


ii- yarn-site.xml

The YARN ResourceManager runs on localhost and supports MapReduce shuffle.


<!-- Site specific YARN configuration properties -->

iii- hdfs-site.xml

The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.

1. First property create number of replication for datanode with name dfs.replication
2. Second property give the name of namenode directory with file directory you have to create at “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode 
3. Third property give the name of datanode directory with file directory you have to create “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode 



iv- mapred-site.xml

First copy mapred-site.xml.template to mapred-site.xml and add the following content.

At terminal of Ubuntu $>> cp  mapred-site.xml.template  mapred-site.xml >> press enter



v- Here we set java environment variable


 Step5. After configuring all five important files you have format the namenode with following command on ubuntu terminal.
user@ubuntuvm:~$ cd hadoop-2.5.2/bin/
user@ubuntuvm:~/hadoop-2.5.2/bin$ ./hadoop namenode -format


Step6. After formatting namenode we have to start all daemon services of hadoop-2.5.2
move to first /home/user/hadoop-2.5.2/sbin

 user@ubuntuvm:~$ cd hadoop-2.5.2/sbin/

i- datanode daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./ start datanode

ii- namenode daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./ start namenode
iii- resourcemanager daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./ start resourcemanager

iv- nodemanager daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./ start nodemanager

v- jobhistoryserver daemon service
 user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./ start historyserver

Step7. To verify all daemon services please write following command at terminal

 user@ubuntuvm:~/hadoop-2.5.2/sbin$ jps
3869 DataNode
4067 ResourceManager
4318 NodeManager
4449 JobHistoryServer
4934 NameNode
5389 Jps


Suppose if some of the services are not started yet please verify logs in logs folder of hadoop-2.5.2 at following location

 Here you can check each every file for logs

Step8. To verify all services at browser goto filefox of virtaul machine and open following url…


We can check two more files also
1. slaves-

For other Delete localhost and add all the names of the TaskTrackers, each in on line. For example:


2. master-

Hadoop Installation Tutorial (Hadoop 1.x)

Software Required-
Setup Virtual Machine

Step1. >goto>goto ubuntu(Ubuntu1404)->download it->extract it

Step2. Suppose your directory after extract it
“D:personal datahadoopUbuntu1404”

Step3. >goto google->search VMWARE PLAYER->goto result select DESKTOP & END USER->download it->install it

Step4. After installation of virtual machine goto-“D:personal datahadoopUbuntu1404


Step5. double click “Ubuntu.vmx” then virtual is running after that open now as following.


Step6. ->>Through the VM machine download hadoop release Hadoop-1.2.1(61M) >> extract it “hadoop-1.2.1.tar.gz

Step7. In this tutorial Hadoop install into following location

Step8. ->>install Java in Linux
sudo apt-get install openjdk-7-jdk

Step9. In this tutorial JDK install into following location

Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).

Step10. Enable “hadoop” user to password-less SSH login to slaves-
Just for our convenience, make sure the “hadoop” user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found Enabling Password-less ssh Login.

Step11. Hadoop Configuration
Configure environment variables of “hadoop” user
Open terminal of command prompt and set environment  variable as follows

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export JAVA_INSTALL=/home/user/hadoop-1.2.1
 and Hadoop Path assign follows
export HADOOP_COMMON_HOME=”/home/hadoop/hadoop/”

The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message “Hadoop common not found”.

The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.

Step12. Configure Important files for Hadoop
A. /home/user/hadoop-1.2.1/conf/
Add or change these lines to specify the JAVA_HOME and directory to store the logs:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export HADOOP_LOG_DIR=/home/user/hadoop-1.2.1/logs


B. /home/user/hadoop-1.2.1/conf/core-site.xml (configuring NameNode)
Here the NameNode runs on or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


C. /home/user/hadoop-1.2.1/conf/hdfs-site.xml (Configuring DataNode)
dfs.replication is the number of replicas of each block. is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->





D. /home/user/hadoop-1.2.1/conf/mapred-site.xml (Configuring JobTracker)
Here the JobTracker runs on or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->






mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

E. /home/user/hadoop-1.2.1/conf/slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

F. Start Hadoop
We need to start both the HDFS and MapReduce to start Hadoop.

1. Format a new HDFS
On NameNode
$ hadoop namenode -format
Remember to delete HDFS’s local files on all nodes before re-formating it:
$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf

2. Start HDFS
On NameNode :


3.Check the HDFS status:
On NameNode :

$ hadoop dfsadmin -report
There may be less nodes listed in the report than we actually have. We can try it again.

4. Start mapred:
On JobTracker:


5.Check job status:

$ hadoop job -list

Shut down Hadoop cluster

We can stop Hadoop when we no long use it.

Stop HDFS on NameNode:


Stop JobTracker and TaskTrackers on JobTracker: