Hadoop Environment Setup

Ubuntu Installation

Version: Ubuntu Desktop 18.04.2 LTS

Installation Tutorial: Install Ubuntu Desktop

Disable Auto Update

Disable Auto Shut Down and Sleep

Notice: In log in details session, please set your computer’s name as master/slave1/slave2/slave3 and set username as hadoop across all machines.

Hadoop Environment Setup

Pre-installation Setup

Checking Hostname

1
2
hadoop@slave1:~$ hostname
slave1

Checking Current IP Address

1
2
hadoop@slave1:~$ hostname -I
10.22.16.84 172.17.0.1

Install vim

1
2
3
hadoop@slave1:~$ cd /
hadoop@slave1:/$ cd etc
hadoop@slave1:/etc$ sudo apt install vim

Add IP Addresses

Insert the information from the table below into /etc/hosts file.

IP Addresses Hostnames
10.22.17.39 master
10.22.16.84 slave1
10.22.17.150 slave2
10.22.17.79 slave3

Command to open and insert information:

1
hadoop@slave1:/etc$ sudo vim hosts

Check if connections to other machines can be established:

1
hadoop@slave1:/etc$ ping master

Java JDK Installation

Version: Java SE Development Kit 8u221 (Requires Registration)

Extract the Files to /usr/lib/jvm/

Add Java’s Path into $PATH

Open /etc/profile file to write the Java path into:

1
2
hadoop@slave1:~$ cd /etc
hadoop@slave1:~$ sudo vim profile

Insert the following code into the file:

1
2
3
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_211
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin

Source the file to apply the changes:

1
hadoop@slave1:~$ source profile

Notice: You may need to restart your computer to apply the changes permanently.

Check Java Version and Path

1
2
3
4
5
6
hadoop@slave1:~$ java -version
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
hadoop@slave1:~$ which java
/usr/lib/jvm/jdk1.8.0_211/bin/java

Setup SSH

Setting up this to allow the machines to connect each other without entering passwords.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#Installing SSH
hadoop@slave1:~$ sudo apt-get install openssh-server

#Generate a SSH key
hadoop@slave1:~$ ssh-keygen -t rsa
hadoop@slave1:~$ cd .ssh/

#Copy key into authorized_keys file
hadoop@slave1:~/.ssh/$ cat id_rsa.pub >> authorized_keys

#To set the file that the owner can read and write on it
hadoop@slave1:~/.ssh/$ chmod 0600 authorized_keys

#Configer settings on sshd_config file
hadoop@slave1:~/.ssh/$ sudo vim /etc/ssh/sshd_config

Type these lines in the end of the sshd_config file:

1
2
3
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile %h/.ssh/authorized_keys

Restarting the SSH service and copy its ssh id to other machines:

1
2
3
hadoop@slave1:~/.ssh/$ service ssh restart
hadoop@slave1:~/.ssh/$ ssh-copy-id master@master
hadoop@slave1:~/.ssh/$ ssh 'master@master'

Hadoop Installation

Install Hadoop

  • Version: Hadoop 3.1.2 (Binary)
  • Move Hadoop folder to /usr/ folder
  • Change folder name into hadoop
  • Make tmp folder inside of the hadoop folder
1
2
3
4
5
6
7
8
hadoop@slave1:/$ sudo mv hadoop-3.1.2/ /usr/
hadoop@slave1:/$ cd usr/
hadoop@slave1:/usr$ sudo mv hadoop-3.1.2/ hadoop

#This line is going to give the permission to the user
hadoop@slave1:/usr$ chown -R hadoop:slave1 hadoop/
hadoop@slave1:/usr$ cd hadoop/
hadoop@slave1:/usr/hadoop$ mkdir tmp

Add Hadoop’s Path into $PATH

Open /etc/profile file to write the Hadoop path into:

1
2
hadoop@slave1:~$ cd /etc
hadoop@slave1:~$ sudo vim profile

Insert the following code into the file:

1
2
export HADOOP_HOME=/usr/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Source the file to apply the changes:

1
hadoop@slave1:~$ source profile

Notice: You may need to restart your computer to apply the changes permanently.

Configure Hadoop

Change configeration settings in 5 following files:

1
2
3
4
5
6
#Finding the paths on each file
hadoop@slave1:/usr$ find hadoop -name hadoop-env.sh
hadoop@slave1:/usr$ find hadoop -name core-site.xml
hadoop@slave1:/usr$ find hadoop -name hdfs-site.xml
hadoop@slave1:/usr$ find hadoop -name mapred-site.xml
hadoop@slave1:/usr$ find hadoop -name yarn-site.xml

5 paths:

  • hadoop-env.sh - hadoop/etc/hadoop/hadoop-env.sh
  • core-site.xml - hadoop/etc/hadoop/core-site.xml
  • hdfs-site.xml - hadoop/etc/hadoop/hdfs-site.xml
  • mapred-site.xml - hadoop/etc/hadoop/mapred-site.xml
  • yarn-site.xml - hadoop/etc/hadoop/yarn-site.xml

Configure hadoop_env.sh:

1
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_211

Configure core-site.xml:

1
2
3
4
5
6
7
8
9
10
11
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

Configure hdfs-site.xml:

1
2
3
4
5
6
7
8
9
10
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.premissions</name>
<value>false</value>
</property>
</configuration>

Configure mapred-site.xml:

1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Configure yarn-site.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>12288</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

The following command is going to insert the name for all workers in lines:

1
hadoop@slave1:/usr/hadoop/etc/hadoop$ vim workers

Here is the content in workers file:

1
2
3
4
master
slave1
slave2
slave3

Spark Environment Setup

Install Spark

  • Version: Spark 2.4.3 (Binary)
  • Move Spark folder to /usr/hadoop/ folder
  • Change folder name into spark

Add Pathes into $PATH

Open the .bashrc file:

1
hadoop@master:~$ vim .bashrc

In .bashrc file, insert the following lines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
export HADOOP_HOME=/usr/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export SPARK_HOME=/usr/hadoop/spark
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME/bin:~/.local/bin:$SPARK_HOME

Source the file to apply the changes:

1
hadoop@master:~$ source .bashrc

Notice: You may need to restart your computer to apply the changes permanently.

Configure Spark

Configure the spark-defaults.conf file:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Change the template into actual file
hadoop@master:/usr/hadoop/spark$ mv /conf/spark-defaults.conf.template /conf/spark-defaults.conf
hadoop@master:/usr/hadoop/spark$ mv /conf/spark-env.template /conf/spark-env.sh

#Open and write file
hadoop@master:/usr/hadoop$ vim spark/conf/spark-defaults.conf

#Contents should be put into the file
spark.master yarn
spark.eventLog.enabled true

#Create the log directory in HDFS
hadoop@master:/usr/hadoop$ hdfs dfs -mkdir /spark-logs

Checking Spark version

1
hadoop@master:/usr/hadoop$ spark-shell –-version

Setup Public Jupyter Notebook

After the installation of jupyter:

1
2
3
hadoop@master:~$ jupyter notebook --generate-config
hadoop@master:~$ cd .jupyter/
hadoop@master:~/.jupyter/$ vim jupyter_notebook_config.py

Uncomment lines and adjust some values:

1
2
3
c.NotebookApp.ip = 'master'
c.NotebookApp.port = 9999
c.NotebookApp.allow_password_change = True

Notice: After the first change of the password and login please set allow_password_change into False or comment it out.

1
c.NotebookApp.allow_password_change = False

Extras

Change Machines’ Username

The username should be all the same in different machines because when hadoop connects to other machines, it uses its username as default username for the other machines to connect each other.

1
2
3
4
5
#Here should be the format for username@hostname on each machine
hadoop@master
hadoop@slave1
hadoop@slave2
hadoop@slave3

If the username has been set wrong by mistake when installing the system, it needs changing by using the following commands:

1
2
3
hadoop@slave1:~$ sudo passwd root
hadoop@slave1:~$ su -
root@slave1:~\# usermod -l hadoop -d /home/hadoop -m slave1

Notice: These command can only run after logging into other user. So, please create a new user and then logout the purpose user and login into the new user to type these command to change the purpose user’s username by typing the commands above to the new user’s terminal.

Docker (Suspended - Not in use)

Guide to Install Docker

Sign Up Docker

Login Docker:

1
2
hadoop@slave1:~$ sudo docker login
hadoop@slave1:~$ mkdir images

Guide to Use Hadoop Image

Pull -> Run

Useful Commands

General Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#Format the namenode - ONLY RUN ONCE
hadoop@master:~$ hadoop namenode -format

#Start Service
hadoop@master:~$ start-all.sh

#Stop Service
hadoop@master:~$ stop-all.sh

#Get hdfs Report
hadoop@master:~$ hdfs dfsadmin -report

#Copy Files to Remote Computer
hadoop@slave1:~$ scp -r <folder_name> <remote_username>@<remote_hostname>:<remote_path>

#Change Permission for Remote Computer
hadoop@slave1:~$ chown -R <remote_username>:<remote_hostname> <folder_name>

Resetting $PATH (In case if the $PATH is overwritten by mistake)

1
2
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:"
hadoop@slave1:/$ source environment

References

Hadoop

Hadoop Environment Configeration

一步步教你Hadoop多节点集群安装配置

Hadoop分布式集群搭建

Hadoop系统完全分布式集群搭建方法

Java JDK

How to install the JDK on Ubuntu Linux (OpenJDK)

Differences between OpenJDK and Oracle JDK

Spark

How to set up PySpark for your Jupyter notebook

Install Spark 2.3.x on YARN with Hadoop 3.x

RDD Programming Guide

Jupyter Notebook

Tutorial on setting up public jupyter notebook

HDFS

Using hdfs command line to manage files and directories on Hadoop

Docker

How To Install Docker On Ubuntu

Author: Zilan Huang
Link: http://hoanjinan.github.io/2019/08/19/Hadoop-Environment-Setup/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • 微信
  • 支付寶

Comment