Dinesh on Java

Hadoop

JobTracker and TaskTracker Design

JobTracker and TaskTracker are coming into picture when we required processing to data set. In hadoop system there are five services always running in background (called hadoop daemon services).

Daemon Services of Hadoop-

Namenodes
Secondary Namenodes
Jobtracker
Datanodes
Tasktracker

Above three services 1, 2, 3 can talk to each other and other two services 4,5 can also talk to each other. Namenode and datanodes are also talking to each other as well as Jobtracker and Tasktracker are also.

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby.

If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

JobTracker and TaskTrackers Work Flow-

1. User copy all input files to distributed file system using namenode meta data.
2. Submit jobs to client which applied to input files fetched stored in datanodes.
3. Client get information about input files from namenodes to be process.
4. Client create splits of all files for the jobs
5. After splitting files client stored meta data about this job to DFS.
6. Now client submit this job to job tracker.

7. Now jobtracker come into picture and initialize job with job queue.
8. Jobtracker read job files from DFS submitted by client.
9. Now jobtracker create maps and reduces for jobs and input splits applied to mappers. Same number of mapper are there as many input splits are there. Every map work on individual split and create output.

10. Now tasktrackers come into picture and jobs submitted to every tasktrackers by jobtracker and receiving heartbeat from every TaskTracker for confirming tasktracker working properly or not. This heartbeat frequently sent to JobTracker in 3 second by every TaskTrackers. If suppose any task tracker is not sending heartbeat to jobtracker in 3 second then JobTracker wait for 30 second more after that jobtracker consider those tasktracker as a dead state and upate metadata about those task trackers.
11. Picks tasks from splits.
12. Assign to TaskTracker.

Finally all tasktrackers create outputs and number of reduces generate as number of outputs created by task trackers. After all reducer give us final output.

Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!