When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.
I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.
Part 1: Provision the server, harden, and install Java
In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.
After provisioning and booting up, I completed the following steps:
Set hostname inĀ /etc/hosts (hadoop-master)
sudo nano /etc/hosts
Add the following line:
<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts
Add the following line:
<your server ip> hadoop-master
Harden per Linode recommendations
Login to hadoop-master as root.
adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser
Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:
ssh-copy-id -i <key name> hduser@<server ip>
Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.
sudo nano /etc/ssh/sshd_config
In the config file, you’ll change a few lines to look like this:
Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no
Save and exit text editor. One last command to disable IPV6, then restart SSHD:
echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd
Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.
sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot
This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.
sudo apt-get update
sudo apt-get install default-jdk
sudo reboot
Part 2: Create localhost SSH access for Hadoop and install
Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.
ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys ssh -p 2222 localhost
Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.
cd wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xvzf hadoop-2.7.3.tar.gz sudo mkdir -p /usr/local/hadoop cd hadoop-2.7.3/ sudo mv * /usr/local/hadoop sudo chown -R hduser:hadoop /usr/local/hadoop
Part 3: Hadoop configuration
Variables configuration
Here, you have to do some checking to make sure your Java library is what is expected.
update-alternatives --config java
In this case we are looking for /lib/jvm/java-8-openjdk-amd64
Edit your bashrc file first.
sudo nano ~/.bashrc
Add the following:
#HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" #HADOOP VARIABLES END
Now edit your hadoop-env file.
sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Edit the following line to look like this:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Add this line:
export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
core-site.xml
sudo mkdir -p /app/hadoop/tmp sudo chown hduser:hadoop /app/hadoop/tmp sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml
Add the following
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://hadoop-master:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
mapred-site.xml
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
Add the following.
<configuration> <property> <name>mapred.job.tracker</name> <value>hadoop-master:54311</value> <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
hdfs-site.xml
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode sudo chown -R hduser:hadoop /usr/local/hadoop_store sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Add the following.
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property> </configuration>
yarn-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml
Add the following.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Part 4. Reboot and fire it up!
After reboot, format HDFS.
hdfs namenode -format
Start service.
start-dfs.sh start-yarn.sh
Test with a simple job.
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5
You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.
Most content also appears on my LinkedIn page.