Running Hadoop on Ubuntu Linux (Single-Node Cluster)
In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Are you looking for the multi-node cluster tutorial? Just head over there.
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates
features similar to those of the Google File System (GFS) and of the
MapReduce computing paradigm.
Hadoop’s HDFS is a highly fault-tolerant distributed file
system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to
application data and is suitable for applications that have large data sets.The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:
Prerequisites
Sun Java7 or above
$ sudo apt-get update
$ sudo apt-get install openjdk-7-jdk
$ java -version
Adding a dedicated Hadoop system user
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access tolocalhost
for the hduser
user we created in the previous section.I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication.
First, we have to generate an SSH key for the
hduser
user.user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not
recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter
the passphrase every time Hadoop interacts with its nodes).Second, you have to enable SSH access to your local machine with this newly created key.
|
hduser
user. The step is
also needed to save your local machine’s host key fingerprint to the hduser
user’s known_hosts
file. If you
have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific
SSH options in $HOME/.ssh/config
(see man ssh_config
for more information).
If the SSH connect should fail, these general tips might help:
HadoopInstallation
Download Hadoop from the
Apache Download Mirrors and extract the contents of the Hadoop
package to a location of your choice. I picked
/usr/local/hadoop . Make sure to change the owner of all the
files to the hduser user and hadoop group, for example:
Update $HOME/.bashrcAdd the following lines to the end of the$HOME/.bashrc file of user hduser . If you use a shell other than
bash, you should of course update its appropriate configuration files instead of .bashrc .
ConfigurationOur goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.hadoop-env.shThe only required environment variable we have to configure for Hadoop in this tutorial isJAVA_HOME . Open
conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path
is /usr/local/hadoop/conf/hadoop-env.sh ) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6
directory.Change conf/hadoop-env.sh
conf/hadoop-env.sh
conf/*-site.xmlIn this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you
must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s
default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS,
so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.Now we create the directory and set the required ownerships and permissions:
java.io.IOException when you try to
format the name node in the next section).Add the following snippets between the tags in the respective configuration
XML file.In file conf/core-site.xml : |
In file
conf/mapred-site.xml
:In file
conf/hdfs-site.xml
:Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir
variable), run the
command
|
|
|
Starting your single-node cluster
Run the command:
|
The output will look like this:
|
jps
(part of Sun’s Java since
v1.5.0). See also How to debug MapReduce programs.
|
netstat
if Hadoop is listening on the configured ports.
|
/logs/
directory.Stopping your single-node cluster
Run the command
|
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (seeconf/hadoop-default.xml
) available at these
locations:- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.By default, it’s available at http://localhost:50070/.