Wednesday, November 19, 2014

Hadoop Fully Distributed Mode Cluster

In this blog we will learn how to setup a Hadoop Fully Distributed Mode Cluster with one Namenode (master) and 2 Datanodes (slaves) using your Windows 64 bit PC or laptop

1.  Towards this we need to have 2 Virtual Machines (VM) in the same windows system ready, on which Hadoop should be installed in Pseudo Distributed Mode.
For details on how to setup Hadoop Pseudo Distributed Mode cluster, please refer my blog http://hadooppseudomode.blogspot.in/ . In this blog I have recorded detailed steps with supported screenshots to install and setup Hadoop cluster in a Pseudo Distributed Mode using your Windows 64 bit PC or laptop


Note: Using VMware you can have 2 VM’s running simultaneously. To do that, open VMware player by double clicking its desktop icon and select the 1st VM and play. Then go back to your windows desktop and double click on the VMware desktop icon to initiate it’s another session and play the 2nd VM

2.    I will use the 1st virtual machine (Named: Master) to work both as Namenode and Datanode. The 2nd virtual machine (Named: Slave) will be used as Datanode.

Note: The names given to these virtual machines in the VMware Player is only for identification purpose and it does not affect the cluster setup. You are free to name your 2 VM as per your preference


3.   Play both the VM’s (Master and Slave system) and login using hduser account

4.   We need Stop all the Hadoop services in both the Master and Slave system. From the Master system type the below command and then login to the Salve system and run the same command

/home/hduser/hadoop/bin/stop-all.sh


5.    Note down the IP Address of both the VM’s. For that open its LXTerminal and type command
       ifconfig


6.    My Master system (1st VM) IP address is 192.168.xx.xx2 and my Slave system (2nd VM)  IP address is 192.168.xx.xx3

7.    Login to the Master system (1st VM). Open its LXTerminal. Type the below command to check if the Slave system (2nd VM) is in the same network and is accessible.
ping -c 5 <<ip address of the Slave system>>

ping -c 5 192.168.xx.xx3


Next step is to edit the /etc/hosts file in which we will put the IPs and Name of each system in the cluster. This has to be done in both the Master (1st VM) and Slave (2nd VM) systems

8.    From Master system (1st VM) - Open LXTerminal and type the below command
sudo leafpad /etc/hosts
Enter hduser password and hosts file will open
In the hosts file add the ip address and name of both the systems (master and slave) right below the ubuntu system name. The IP address and its name should be tab separated.

192.168.xx.x2             master
192.168.xx.x3             slave


9.    From Slave system (2nd VM) – Repeat Step No 8 as listed above

Next step is to give access to Master system to gain a password less shell entry to Slave System. That is, from the Master system you should be able to access the shell of the Slave system without the need to password. This is achieved by sharing Master system public key with the Slave system

10.   From Master system (1st VM) – Type the following command
ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@slave
You will be prompted for confirmation. Type yes and enter. And then for password. Give password of hduser.
Slave added to the list of know hosts confirmation message would show on screen


Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh slave


Next step is to record the Namenode systems name/ipaddress in the hadoop’s /conf/masters file. This has to be done in both the Master (1st VM) and Slave (2nd VM) systems

11. From Master system (1st VM) – Type the following command to open masters file
leafpad /home/hduser/hadoop/conf/masters

Enter following values and save
master


12.  From Slave system (2nd VM) – Repeat Step 11 as listed above

Next we have to list the names of the salves system in the hadoop’s /conf/slaves file. This is needed only in the Master (1st VM) system

13.  In the Master system (1st VM) – Type the following command to open slaves file
leafpad /home/hduser/hadoop/conf/slaves

Enter following values
master
slave


Next we will have to record the Namenode system’s name in core-site.xml file and the Job tracker system name in mapred-site.xml file. This has to be done in both the Master (1st VM) and Slave (2nd VM) systems

14. From Master system (1st VM) – Type the following command to open core-site.xml file

leafpad /home/hduser/hadoop/conf/core-site.xml

Replace the word localhost:54310 with master:54310


15  From Master system (1st VM) – Type the following command to open mapred-site.xml file

leafpad /home/hduser/hadoop/conf/mapred-site.xml

Replace the word localhost:54311 with master:54311


16. From Slave system (2st VM) – Repeat Step 14 and 15 as listed above and save the changes made to core -site.xml and mapred-site.xml files

17. In the Master system (1st VM) – Type the following command to open hdfs-site.xml file. Here we will change the replication factor to 2

leafpad /home/hduser/hadoop/conf/hdfs-site.xml


Since the Salve system was initially configured to function as Pseudo Mode Cluster, its datanode will still be referring the Namenode namespaceID of the same system instead of using the Namenode namespaceID of the master system (new Namenode). The namespaceID is stored in the VERSION file. This needs to be changed only in the Slave system as the Master system still continues to work both as master and slave

18. From Master system (1st VM) – We need to find out the namespaceID of the Namenode. For that type the following command to open VERSION file

sudo leafpad /home/hduser/hadoop_tmp_files/dfs/data/current/VERSION

Just Note down the namespaceid. My Master systems namespaceID=306023161

Note:  The VERSION file is located inside hadoop_tmp_files folder whose path is specified in the /hadoop/bin/core-site.xml file. If you have changed the storage path, pass the same path in the above command


19. From Slave system (2st VM) – We need to change the namespaceID which is shared by the namenode with the datanode. For that, type the following command to open VERSION file

sudo leafpad /home/hduser/hadoop_tmp_files/dfs/data/current/VERSION

Change the namespaceID to what you have noted down from the previous step. In my example I will change it to
namespaceID=306023161

Note:  The VERSION file is located inside hadoop_tmp_files folder whose path is specified in the /hadoop/bin/core-site.xml file. If you have changed the storage path, pass the same path in the above command


20.   From Master system (1st VM) – Next step is to Start all hadoop services

        /home/hduser/hadoop/bin/start-all.sh


21   From Master system (1st VM) – Type jps command to check if the below listed java processes are running

namenode
secondary namenode
datanode
jobtracker
tasktracker

22. From Slave system (2nd VM) – Type jsp command to check if the below listed java processes are running

tasktracker
datanode

23  From system (1st VM) – Open Chrome browser in your lubuntu Master machine and browse for Localhost:50070.  Number of Live nodes would show up as 2. This confirms that you have setup a fully distributed Hadoop Cluster with 2 Datanodes.


Hope this helped. Do reach out to me if you have any questions.
If you found this blog useful, please convey your thanks by posting your comments