Categories: Hadoop

Introduction to MapReduce

In this hadoop tutorial we will introduce map reduce, what is map reduce. Before map reduce how to analyze the bigdata. Please look into following picture.

Here bigdata split into equal size and grep it using linux command and matches with some specific characters like high temperature of any large data set of weather department. But this way have some problems as follows.

Problems in the Traditional way analysis-

1. Critical path problem (Its amount of time to take to finish the job without delaying the next milestone or actual completion date).
2. Reliability problem
3. Equal split issues
4. Single split may failure
5. Sorting problem

For overcome these all problems Hadoop introduce mapreduce in picture for analyzing such amount of data in fast.

What is MapReduce

  • MapReduce is a programming model for processing large data sets.
  • MapReduce is typically used to do distributed computing on clusters of computers.
  • The model is inspired by the map and reduce functions commonly used in functional programming.
  • Function output is dependent purely on the input data and not on any internal state. So for a given input the output is always guaranteed.
  • Stateless nature of the functions guarantees scalability.

Key Features of MapReduce Systems

  • Provides Framework for MapReduce Execution
  • Abstract Developer from the complexity of Distributed Programming
  • Partial failure of the processing cluster is expected and tolerable.
  • Redundancy and fault-tolerance is built in, so the programmer doesn’t have to worry
  • MapReduce Programming Model is Language Independent
  • Automatic Parallelization and distribution
  • Fault Tolerance
  • Enable Data Local Processing
  • Shared Nothing Architecture Model
  • Manages inter-process communication

MapReduce Explained

  • MapReduce consist of 2 Phases or Steps
    • Map
    • Reduce

The “map” step takes a key/value pair and produces an intermediate key/value pair.

The “reduce” step takes a key and a list of the key’s values and outputs the final key/value pair.

 

  • MapReduce Simple Steps
    • Execute map function on each input received
    • Map Function Emits Key, Value pair
    • Shuffle, Sort and Group the outputs
    • Executes Reduce function on the group
    • Emits the output per group

Map Reduce WAY-

1. Very big data convert in to splits
2. Splits are processed by mapper
3. Some partitioning functionality operated on the output of mapper
4. After that data move to Reducer and produce desire output

Anatomy of a MapReduce Job Run-

  • Classic MapReduce (MapReduce 1)
    A job run in classic MapReduce is illustrated in following Figure. At the highest level, there
    are four independent entities:
    • The client, which submits the MapReduce job.
    • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
    • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
  • YARN (MapReduce 2)
    MapReduce on YARN involves more entities than classic MapReduce. They are:
    • The client, which submits the MapReduce job.
    • The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
    • The YARN node managers, which launch and monitor the compute containers on machines in the cluster.
    • The MapReduce application master, which coordinates the tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers.
    • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
    The process of running a job is shown in following Figure and described in the following sections.

     

Previous
Next
Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!

Share
Published by
Dinesh Rajput

Recent Posts

Strategy Design Patterns using Lambda

Strategy Design Patterns We can easily create a strategy design pattern using lambda. To implement…

2 years ago

Decorator Pattern using Lambda

Decorator Pattern A decorator pattern allows a user to add new functionality to an existing…

2 years ago

Delegating pattern using lambda

Delegating pattern In software engineering, the delegation pattern is an object-oriented design pattern that allows…

2 years ago

Spring Vs Django- Know The Difference Between The Two

Technology has emerged a lot in the last decade, and now we have artificial intelligence;…

2 years ago

TOP 20 MongoDB INTERVIEW QUESTIONS 2022

Managing a database is becoming increasingly complex now due to the vast amount of data…

2 years ago

Scheduler @Scheduled Annotation Spring Boot

Overview In this article, we will explore Spring Scheduler how we could use it by…

2 years ago