Dinesh on Java

Categories: Hadoop

Hadoop Installation Tutorial (Hadoop 2.x)

This chapter explains how to set up Hadoop to run on a cluster of machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work they need to run on multiple nodes.

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves).

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found in Enabling Password-less ssh Login.

Install software needed by Hadoop

The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.

Install Java JDK on UBUNTU

Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.

user@ubuntuvm:~$ sudo apt-get java-7-openjdk-i386

As an example in this tutorial, the JDK is installed into

/usr/lib/jvm/java-7-openjdk-i386

You may need to make soft link to /usr/java/default from the actual location where you installed JDK.

Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$JAVA_HOME/bin:$PATH

Hadoop 2.x.x Configuration-

Step1. Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop-2.5.2.

You can unpack the tar ball to a directory. In this example, we unpack it to

/home/user/Hadoop-2.5.2

which is a directory under the hadoop Linux user’s home directory.

The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.

Step2. Configure environment variables for the “hadoop” user

We assume the “hadoop” user uses bash as its shell.

Add these lines at the bottom of ~/.bashrc on all nodes:
goto terminal >> sudo gedit .bashrc >> press enter >> put password “password”

Step3. Put following path to .bashrc file
export HADOOP_HOME=$HOME/hadoop-2.5.2
export HADOOP_CONF_DIR=$HOME/hadoop-2.5.2/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.5.2
export HADOOP_COMMON_HOME=$HOME/hadoop-2.5.2
export HADOOP_HDFS_HOME=$HOME/hadoop-2.5.2
export YARN_HOME=$HOME/hadoop-2.5.2

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

export PATH=$PATH:$HOME/hadoop-2.5.2

Step4- Configure Hadoop-2.5.2 Important files

The configuration files for Hadoop is under /home/user/hadoop-2.5.2/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.

i- core-site.xml–

Here the NameNode runs on localhost.

<configuration>
<property>
 <name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

ii- yarn-site.xml –

The YARN ResourceManager runs on localhost and supports MapReduce shuffle.

<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-service.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

iii- hdfs-site.xml–

The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.

1. First property create number of replication for datanode with name dfs.replication
2. Second property give the name of namenode directory dfs.namenode.name.dir with file directory you have to create at “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode“
3. Third property give the name of datanode directory dfs.datanode.name.dir with file directory you have to create “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode“

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode</value>
</property>

</configuration>

iv- mapred-site.xml–

First copy mapred-site.xml.template to mapred-site.xml and add the following content.

At terminal of Ubuntu $>> cp mapred-site.xml.template mapred-site.xml >> press enter

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

</configuration>

v- hadoop-env.sh Here we set java environment variable

Step5. After configuring all five important files you have format the namenode with following command on ubuntu terminal.

user@ubuntuvm:~$ cd hadoop-2.5.2/bin/
user@ubuntuvm:~/hadoop-2.5.2/bin$ ./hadoop namenode -format

Step6. After formatting namenode we have to start all daemon services of hadoop-2.5.2
move to first /home/user/hadoop-2.5.2/sbin

user@ubuntuvm:~$ cd hadoop-2.5.2/sbin/

i- datanode daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start datanode

ii- namenode daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start namenode
iii- resourcemanager daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start resourcemanager

iv- nodemanager daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start nodemanager

v- jobhistoryserver daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./mr-jobhistory-daemon.sh start historyserver

Step7. To verify all daemon services please write following command at terminal

user@ubuntuvm:~/hadoop-2.5.2/sbin$ jps
3869 DataNode
4067 ResourceManager
4318 NodeManager
4449 JobHistoryServer
4934 NameNode
5389 Jps

Suppose if some of the services are not started yet please verify logs in logs folder of hadoop-2.5.2 at following location

“/home/user/hadoop-2.5.2/logs/”
Here you can check each every file for logs
hadoop-user-namenode-ubuntuvm.log
hadoop-user-namenode-ubuntuvm.out
hadoop-user-datanode-ubuntuvm.log
hadoop-user-datanode-ubuntuvm.out
—–
—-
etc.

Step8. To verify all services at browser goto filefox of virtaul machine and open following url…
“http://localhost:50070/”

We can check two more files also
1. slaves-
localhost

For other Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

hofstadter
snell

2. master-
localhot

Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!