Introduction to MapReduce

In this hadoop tutorial we will introduce map reduce, what is map reduce. Before map reduce how to analyze the bigdata. Please look into following picture.

Introduction to MapReduce

Here bigdata split into equal size and grep it using linux command and matches with some specific characters like high temperature of any large data set of weather department. But this way have some problems as follows.

Problems in the Traditional way analysis-

1. Critical path problem (Its amount of time to take to finish the job without delaying the next milestone or actual completion date).
2. Reliability problem
3. Equal split issues
4. Single split may failure
5. Sorting problem

For overcome these all problems Hadoop introduce mapreduce in picture for analyzing such amount of data in fast.


What is MapReduce
  • MapReduce is a programming model for processing large data sets.
  • MapReduce is typically used to do distributed computing on clusters of computers.
  • The model is inspired by the map and reduce functions commonly used in functional programming.
  • Function output is dependent purely on the input data and not on any internal state. So for a given input the output is always guaranteed.
  • Stateless nature of the functions guarantees scalability.  
Key Features of MapReduce Systems
  • Provides Framework for MapReduce Execution
  • Abstract Developer from the complexity of Distributed Programming
  • Partial failure of the processing cluster is expected and tolerable.
  • Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry
  • MapReduce Programming Model is Language Independent
  • Automatic Parallelization and distribution
  • Fault Tolerance
  • Enable Data Local Processing
  • Shared Nothing Architecture Model
  • Manages inter-process communication
MapReduce Explained
  • MapReduce consist of 2 Phases or Steps
    • Map
    • Reduce
The "map" step takes a key/value pair and produces an intermediate key/value pair.

The "reduce" step takes a key and a list of the key's values and outputs the final key/value pair.

map reduce

  • MapReduce Simple Steps
    • Execute map function on each input received
    • Map Function Emits Key, Value pair
    • Shuffle, Sort and Group the outputs
    • Executes Reduce function on the group
    • Emits the output per group
Map Reduce WAY-

mapreduce way

1. Very big data convert in to splits
2. Splits are processed by mapper
3. Some partitioning functionality operated on the output of mapper
4. After that data move to Reducer and produce desire output

Anatomy of a MapReduce Job Run-
  • Classic MapReduce (MapReduce 1)
    A job run in classic MapReduce is illustrated in following Figure. At the highest level, there
    are four independent entities:
    • The client, which submits the MapReduce job.
    • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
    • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

  • YARN (MapReduce 2)
    MapReduce on YARN involves more entities than classic MapReduce. They are:
    • The client, which submits the MapReduce job.
    • The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
    • The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
    • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
    The process of running a job is shown in following Figure and described in the following sections.