Hadoop Architecture

Here we will describe about Hadoop Architecture. In high level of hadoop architecture there are two main modules HDFS and MapReduce.

               Means HDFS + MapReduce = Hadoop Framework

Following pic have high level architecture of hadoop version 1 and version 2-

Hadoop provides a distributed filesystem(HDFS) and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.

The Apache Hadoop framework is composed of the following modules :
1] Hadoop Common - contains libraries and utilities needed by other Hadoop modules

2] Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

3] Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

4] Hadoop MapReduce - a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others
For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

Core Components of Hadoop 1.x(HDFS & MapReduce) :
There are two primary components at the core of Apache Hadoop 1.x : the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These open source projects, inspired by technologies created inside Google.
 Hadoop Distributed File System (HDFS)-Storage
  • Distributed across "nodes"
  • Natively redundant
  • NameNode track location

  • Split a task across processors
  • near data and assembles results
  • self healing and high bandwidth
  • clustered storage
  • JobTracker manages the TaskTracker

NameNode is admin node, is associated with Job Tracker, is master slave architecture.

JobTracker is associated with NameNode with multiple task tracker for processing of data sets.

No comments:

Post a Comment