Hadoop Fully Distributed Mode Cluster: Hadoop Fully Distributed Mode Cluster

In this blog we will learn how to setup a Hadoop Fully Distributed Mode Cluster with one Namenode (master) and 2 Datanodes (slaves) using your Windows 64 bit PC or laptop

1. Towards this we need to have 2 Virtual Machines (VM) in the same windows system ready, on which Hadoop should be installed in Pseudo Distributed Mode.

For details on how to setup Hadoop Pseudo Distributed Mode cluster, please refer my blog http://hadooppseudomode.blogspot.in/ . In this blog I have recorded detailed steps with supported screenshots to install and setup Hadoop cluster in a Pseudo Distributed Mode using your Windows 64 bit PC or laptop

Note: Using VMware you can have 2 VM’s running simultaneously. To do that, open VMware player by double clicking its desktop icon and select the 1^st VM and play. Then go back to your windows desktop and double click on the VMware desktop icon to initiate it’s another session and play the 2^nd VM

2. I will use the 1^st virtual machine (Named: Master) to work both as Namenode and Datanode. The 2^nd virtual machine (Named: Slave) will be used as Datanode.

Note: The names given to these virtual machines in the VMware Player is only for identification purpose and it does not affect the cluster setup. You are free to name your 2 VM as per your preference

3. Play both the VM’s (Master and Slave system) and login using hduser account

4. We need Stop all the Hadoop services in both the Master and Slave system. From the Master system type the below command and then login to the Salve system and run the same command

/home/hduser/hadoop/bin/stop-all.sh

5. Note down the IP Address of both the VM’s. For that open its LXTerminal and type command

ifconfig

6. My Master system (1^st VM) IP address is 192.168.xx.xx2 and my Slave system (2^nd VM) IP address is 192.168.xx.xx3

7. Login to the Master system (1^st VM). Open its LXTerminal. Type the below command to check if the Slave system (2^nd VM) is in the same network and is accessible.

ping -c 5 <<ip address of the Slave system>>

ping -c 5 192.168.xx.xx3

Next step is to edit the /etc/hosts file in which we will put the IPs and Name of each system in the cluster. This has to be done in both the Master (1^st VM) and Slave (2^nd VM) systems

8. From Master system (1^st VM) - Open LXTerminal and type the below command

sudo leafpad /etc/hosts

Enter hduser password and hosts file will open

In the hosts file add the ip address and name of both the systems (master and slave) right below the ubuntu system name. The IP address and its name should be tab separated.

192.168.xx.x2 master

192.168.xx.x3 slave

9. From Slave system (2^nd VM) – Repeat Step No 8 as listed above

Next step is to give access to Master system to gain a password less shell entry to Slave System. That is, from the Master system you should be able to access the shell of the Slave system without the need to password. This is achieved by sharing Master system public key with the Slave system

10. From Master system (1^st VM) – Type the following command

ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@slave

You will be prompted for confirmation. Type yes and enter. And then for password. Give password of hduser.

Slave added to the list of know hosts confirmation message would show on screen

Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh slave

Next step is to record the Namenode systems name/ipaddress in the hadoop’s /conf/masters file. This has to be done in both the Master (1^st VM) and Slave (2^nd VM) systems

11. From Master system (1^st VM) – Type the following command to open masters file

leafpad /home/hduser/hadoop/conf/masters

Enter following values and save

master

12. From Slave system (2^nd VM) – Repeat Step 11 as listed above

Next we have to list the names of the salves system in the hadoop’s /conf/slaves file. This is needed only in the Master (1^st VM) system

13. In the Master system (1^st VM) – Type the following command to open slaves file

leafpad /home/hduser/hadoop/conf/slaves

Enter following values

master

slave

Next we will have to record the Namenode system’s name in core-site.xml file and the Job tracker system name in mapred-site.xml file. This has to be done in both the Master (1^st VM) and Slave (2^nd VM) systems

14. From Master system (1^st VM) – Type the following command to open core-site.xml file

leafpad /home/hduser/hadoop/conf/core-site.xml

Replace the word localhost:54310 with master:54310

15 From Master system (1^st VM) – Type the following command to open mapred-site.xml file

leafpad /home/hduser/hadoop/conf/mapred-site.xml

Replace the word localhost:54311 with master:54311

16. From Slave system (2^st VM) – Repeat Step 14 and 15 as listed above and save the changes made to core -site.xml and mapred-site.xml files

17. In the Master system (1^st VM) – Type the following command to open hdfs-site.xml file. Here we will change the replication factor to 2

leafpad /home/hduser/hadoop/conf/hdfs-site.xml

Since the Salve system was initially configured to function as Pseudo Mode Cluster, its datanode will still be referring the Namenode namespaceID of the same system instead of using the Namenode namespaceID of the master system (new Namenode). The namespaceID is stored in the VERSION file. This needs to be changed only in the Slave system as the Master system still continues to work both as master and slave

18. From Master system (1^st VM) – We need to find out the namespaceID of the Namenode. For that type the following command to open VERSION file

sudo leafpad /home/hduser/hadoop_tmp_files/dfs/data/current/VERSION

Just Note down the namespaceid. My Master systems namespaceID=306023161

Note: The VERSION file is located inside hadoop_tmp_files folder whose path is specified in the /hadoop/bin/core-site.xml file. If you have changed the storage path, pass the same path in the above command

19. From Slave system (2^st VM) – We need to change the namespaceID which is shared by the namenode with the datanode. For that, type the following command to open VERSION file

sudo leafpad /home/hduser/hadoop_tmp_files/dfs/data/current/VERSION

Change the namespaceID to what you have noted down from the previous step. In my example I will change it to

namespaceID=306023161

20. From Master system (1^st VM) – Next step is to Start all hadoop services

/home/hduser/hadoop/bin/start-all.sh

21 From Master system (1^st VM) – Type jps command to check if the below listed java processes are running

namenode
secondary namenode
datanode
jobtracker
tasktracker

22. From Slave system (2nd VM) – Type jsp command to check if the below listed java processes are running

tasktracker

datanode

23 From system (1^st VM) – Open Chrome browser in your lubuntu Master machine and browse for Localhost:50070. Number of Live nodes would show up as 2. This confirms that you have setup a fully distributed Hadoop Cluster with 2 Datanodes.

Hope this helped. Do reach out to me if you have any questions.

If you found this blog useful, please convey your thanks by posting your comments

Hadoop Fully Distributed Mode Cluster

Wednesday, November 19, 2014

Hadoop Fully Distributed Mode Cluster

1 comment:

Blog Archive