Categories: HadoopTutorial

What is HDFS?

What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.

  • Highly fault-tolerant 
    Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
  • High throughput
    HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
  • Suitable for application with large data sets
    Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • Streaming access to file system data
    Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
  • can be built out of commodity hardware

Areas where HDFS is Not a Good Fit Today-

  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

HDFS Components-

  • NameNodes
    • -associated with JobTracker
    • -master of the system
    • maintain and manage the blocks which are present on the DataNodes

    The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file). The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.

  • DataNodes
    • – associated with TaskTracker
    • -salves which are deployed on each machine and provide the actual storage
    • -responsible for serving read and write request f
    • or the clients

    Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block’s metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.

Previous
Next
Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!

Share
Published by
Dinesh Rajput

Recent Posts

Strategy Design Patterns using Lambda

Strategy Design Patterns We can easily create a strategy design pattern using lambda. To implement…

2 years ago

Decorator Pattern using Lambda

Decorator Pattern A decorator pattern allows a user to add new functionality to an existing…

2 years ago

Delegating pattern using lambda

Delegating pattern In software engineering, the delegation pattern is an object-oriented design pattern that allows…

2 years ago

Spring Vs Django- Know The Difference Between The Two

Technology has emerged a lot in the last decade, and now we have artificial intelligence;…

2 years ago

TOP 20 MongoDB INTERVIEW QUESTIONS 2022

Managing a database is becoming increasingly complex now due to the vast amount of data…

2 years ago

Scheduler @Scheduled Annotation Spring Boot

Overview In this article, we will explore Spring Scheduler how we could use it by…

2 years ago