Single-Node Hadoop Installation on Ubuntu 16.04

When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.

I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.

Part 1: Provision the server, harden, and install Java

In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.

After provisioning and booting up, I completed the following steps:

Set hostname inĀ /etc/hosts (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Harden per Linode recommendations

Login to hadoop-master as root.

adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser

Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:

ssh-copy-id -i <key name> hduser@<server ip>

Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.

sudo nano /etc/ssh/sshd_config

In the config file, you’ll change a few lines to look like this:

Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no

Save and exit text editor. One last command to disable IPV6, then restart SSHD:

echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd

Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.

sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot

This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.

sudo apt-get update
sudo apt-get install default-jdk
sudo reboot

Part 2: Create localhost SSH access for Hadoop and install

Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.

ssh-keygen -t rsa 

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

chmod 0600 ~/.ssh/authorized_keys 

ssh -p 2222 localhost

Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.

cd

wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xvzf hadoop-2.7.3.tar.gz

sudo mkdir -p /usr/local/hadoop

cd hadoop-2.7.3/

sudo mv * /usr/local/hadoop

sudo chown -R hduser:hadoop /usr/local/hadoop

Part 3: Hadoop configuration

Variables configuration

Here, you have to do some checking to make sure your Java library is what is expected.

update-alternatives --config java

In this case we are looking for /lib/jvm/java-8-openjdk-amd64

Edit your bashrc file first.

sudo nano ~/.bashrc

Add the following:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#HADOOP VARIABLES END

Now edit your hadoop-env file.

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Edit the following line to look like this:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add this line:

export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
core-site.xml
sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:54311</value>
<description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

sudo chown -R hduser:hadoop /usr/local/hadoop_store

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Part 4. Reboot and fire it up!

After reboot, format HDFS.

hdfs namenode -format

Start service.

start-dfs.sh

start-yarn.sh

Test with a simple job.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5

You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.