What is HDFS?

What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

  • Highly fault-tolerant 
    Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
  • High throughput
    HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
  • Suitable for application with large data sets
    Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • Streaming access to file system data
    Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
  • can be built out of commodity hardware

Areas where HDFS is Not a Good Fit Today-

  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

HDFS Components-

  • NameNodes
    • -associated with JobTracker
    • -master of the system
    • maintain and manage the blocks which are present on the DataNodes

    The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.

  • DataNodes
    • – associated with TaskTracker
    • -salves which are deployed on each machine and provide the actual storage
    • -responsible for serving read and write request f
    • or the clients

    Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block’s metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.