Thursday, September 25, 2014

Hadoop Installation

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Are you looking for the multi-node cluster tutorial? Just head over there.

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:

Ubuntu Linux 12.04 LTS
Hadoop 1.2.1

Prerequisites

Sun Java7 or above

$ sudo apt-get update

$ sudo apt-get install openjdk-7-jdk

$ java -version

Adding a dedicated Hadoop system user

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication.

First, we have to generate an SSH key for the hduser user.

user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.

hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
hduser@ubuntu:~$

If the SSH connect should fail, these general tips might help:

Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

Hadoop

Installation

Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

$ cd /usr/local
$ sudo tar xzf hadoop-1.0.3.tar.gz
$ sudo mv hadoop-1.0.3 hadoop
$ sudo chown -R hduser:hadoop hadoop

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/open-jdk7

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Configuration

Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
Change

conf/hadoop-env.sh

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

conf/hadoop-env.sh

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun
 

conf/*-site.xml

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.
You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section).
Add the following snippets between the ... tags in the respective configuration XML file.
In file conf/core-site.xml:

In file conf/mapred-site.xml:

In file conf/hdfs-site.xml:

Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster

Run the command:

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.

hduser@ubuntu:/usr/local/hadoop$ jps
TaskTracker
JobTracker
DataNode
SecondaryNameNode
Jps
NameNode

You can also check with netstat if Hadoop is listening on the configured ports.

hduser@ubuntu:~$ sudo netstat -plten | grep java
tcp   0  0 0.0.0.0:50070   0.0.0.0:*  LISTEN  1001  9236  2471/java
tcp   0  0 0.0.0.0:50010   0.0.0.0:*  LISTEN  1001  9998  2628/java
tcp   0  0 0.0.0.0:48159   0.0.0.0:*  LISTEN  1001  8496  2628/java
tcp   0  0 0.0.0.0:53121   0.0.0.0:*  LISTEN  1001  9228  2857/java
tcp   0  0 127.0.0.1:54310 0.0.0.0:*  LISTEN  1001  8143  2471/java
tcp   0  0 127.0.0.1:54311 0.0.0.0:*  LISTEN  1001  9230  2857/java
tcp   0  0 0.0.0.0:59305   0.0.0.0:*  LISTEN  1001  8141  2471/java
tcp   0  0 0.0.0.0:50060   0.0.0.0:*  LISTEN  1001  9857  3005/java
tcp   0  0 0.0.0.0:49900   0.0.0.0:*  LISTEN  1001  9037  2785/java
tcp   0  0 0.0.0.0:50030   0.0.0.0:*  LISTEN  1001  9773  2857/java
hduser@ubuntu:~$

If there are any errors, examine the log files in the /logs/ directory.

Stopping your single-node cluster

Run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

to stop all the daemons running on your machine.

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

NameNode Web Interface (HDFS layer)

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.

From yours truly:

From other people:

How to debug MapReduce programs
Hadoop API Overview (for Hadoop 2.x)

25 comments:

Ajay RajApril 9, 2017 at 10:38 AM
Nice post

Red Hat Linux Training in Chennai
Red Hat Training in Chennai
Rhce Training in Chennai
ReplyDelete
Replies
poojaAugust 27, 2018 at 7:17 AM
Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
Hadoop Training in Bangalore
Hadoop Training in Chennai
ReplyDelete
Replies
MounikaSeptember 13, 2018 at 10:06 PM
I am definitely enjoying your website. You definitely have some great insight and great stories.
Python training in marathahalli
Python training in pune
AWS Training in chennai
ReplyDelete
Replies
UnknownSeptember 17, 2018 at 12:57 AM
Your story is truly inspirational and I have learned a lot from your blog. Much appreciated.
java training in chennai | java training in bangalore

java online training | java training in pune
ReplyDelete
Replies
UnknownSeptember 25, 2018 at 2:11 AM
Thank you for an additional great post. Exactly where else could anybody get that kind of facts in this kind of a ideal way of writing? I have a presentation next week, and I’m around the appear for this kind of data.
Data science course in tambaram | Data Science course in anna nagar
Data Science course in chennai | Data science course in Bangalore
Data Science course in marathahalli | Data Science course in btm
ReplyDelete
Replies
pragyachitraOctober 12, 2018 at 11:20 PM
All the points you described so beautiful. Every time i read your i blog and i am so surprised that how you can write so well.
angularjs-Training in sholinganallur

angularjs-Training in velachery

angularjs Training in bangalore

angularjs Training in bangalore

angularjs Training in btm
ReplyDelete
Replies
MounikaDecember 2, 2018 at 10:06 PM
Very well written blog and I always love to read blogs like these because they offer very good information to readers with very less amount of words....thanks for sharing your info with us and keep sharing.
python training Course in chennai
python training in Bangalore
Python training institute in bangalore
ReplyDelete
Replies
UnknownDecember 9, 2018 at 9:36 PM
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
Data Science course in rajaji nagar
Data Science with Python course in chenni
Data Science course in electronic city
Data Science course in USA
Data science course in pune | Data Science Training institute in Pune
ReplyDelete
Replies
zaraDecember 30, 2018 at 10:55 PM
Multi-node cluster description is really brilliant for Hadoop Training in Bangalore
ReplyDelete
Replies
AHMEDMarch 28, 2019 at 7:09 AM

Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
Automation anywhere Training in Chennai | Best Automation anywhere Training in Chennai
uipath training in chennai | Best uipath training in chennai
Blueprism Training in Chennai | Best Blueprism Training in Chennai
Rprogramming Training in Chennai | Best Rprogramming Training in Chennai
Machine Learning training in chennai | Best Machine Learning training in chennai

ReplyDelete
Replies
Riyas FathinApril 1, 2019 at 1:36 AM

Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise? I get so much lately it’s driving me insane, so any assistance is very much appreciated.
AWS Training in Chennai | Best AWS Training in Chennai
Best Data Science Training in Chennai
Best Python Training in Chennai
Best RPA Training in Chennai
Digital Marketing Training in Chennai
Matlab Training in Chennai
Best AWS Course Training in Chennai
Best Devops Course Training in Chennai
Java Training Institute in Chennai
C C++ Training in Chennai
ReplyDelete
Replies
RajeshJuly 15, 2019 at 11:18 PM
thanks for sharing this information
informatica Training in Bangalore
Blue Prism Training in Bangalore
MERN StackTraining in Bangalore
MEAN Stack Training in Bangalore
RPA Training in Bangalore
RPA Training in BTM
Qlikview Training in Bangalore
Qlik Sense Training in Bangalore
ReplyDelete
Replies
htopJuly 17, 2019 at 11:43 PM
thanks for sharing this post
data Science training in chennai
aws training center in chennai
aws training in chennai
aws training institute in chennai
best devops training in chennai
devops training in chennai
best java training in chennai
ReplyDelete
Replies
easylearnAugust 26, 2019 at 9:32 PM
Hi,
Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take big data hadoop training in bangalore. Because big data course in Bangalore is one of the best that one can do while choosing the course.
ReplyDelete
Replies
ramyaOctober 12, 2019 at 4:19 AM
It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
Data science Course Training in Chennai |Best Data Science Training Institute in Chennai
matlab training chennai | Matlab course in chennai
ReplyDelete
Replies
jefrinOctober 17, 2019 at 4:14 AM
Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
Aws training chennai | AWS course in chennai
Rpa training in chennai | RPA training course chennai
ReplyDelete
Replies
Softgen InfotechDecember 16, 2019 at 6:47 AM
such a great word which you use in your article and article is amazing knowledge. thank you for sharing it.

Softgen Infotech is the Best SAP HANA Admin Training in Bangalore located in BTM Layout, Bangalore providing quality training with Realtime Trainers and 100% Job Assistance.
ReplyDelete
Replies
Anand ShankarDecember 30, 2019 at 9:06 PM
Updating with the current trend is strictly advisable and the content furnished here also states the same. Thanks for sharing this wonderful and worth able article in here. The way to expressed is simply awesome. Keep doing this job. Thanks :)

Visit SKARTEC
Click Here
SKARTEC Digital Marketing Academy
digital marketing course in chennai with placement
digital marketing training institute in chennai
digital marketing course near me
digital marketing course in chennai fees
best institute for digital marketing course in chennai
digital marketing course with placement
online digital marketing course in chennai
advance digital marketing course in chennai
digital marketing training institute near me
digital marketing course near me
digital marketing training in india
seo training
ReplyDelete
Replies
latchu kannanMay 19, 2020 at 6:39 AM
it is an amazing blog to explore more

BEST ANGULAR JS TRAINING IN CHENNAI WITH PLACEMENT

https://www.acte.in/angular-js-training-in-chennai
https://www.acte.in/angular-js-training-in-annanagar
https://www.acte.in/angular-js-training-in-omr
https://www.acte.in/angular-js-training-in-porur
https://www.acte.in/angular-js-training-in-tambaram
https://www.acte.in/angular-js-training-in-velachery

ReplyDelete
Replies
nizamMay 24, 2020 at 10:50 PM
It is an nice blog.
AngularJS training in chennai | AngularJS training in anna nagar | AngularJS training in omr | AngularJS training in porur | AngularJS training in tambaram | AngularJS training in velachery

ReplyDelete
Replies
vivekvedhaJune 11, 2020 at 3:41 AM
"Really nice post. Thank you for sharing amazing information.

Digital Marketing Training Course in Chennai | Digital Marketing Training Course in Anna Nagar | Digital Marketing Training Course in OMR | Digital Marketing Training Course in Porur | Digital Marketing Training Course in Tambaram | Digital Marketing Training Course in Velachery
"
ReplyDelete
Replies
RevathiJuly 31, 2020 at 10:20 AM
Excellent post, it will be definitely helpful for many people. Keep posting more like this.thank you so much!!!

android training in chennai

android online training in chennai

android training in bangalore

android training in hyderabad

android Training in coimbatore

android training

android online training

ReplyDelete
Replies
jeniAugust 27, 2020 at 4:36 AM
Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

angular js training in chennai

angular js training in velachery

full stack training in chennai

full stack training in velachery

php training in chennai

php training in velachery

photoshop training in chennai

photoshop training in velachery

ReplyDelete
Replies
shinyAugust 29, 2020 at 3:17 AM
I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!

data science training in chennai

data science training in annanagar

android training in chennai

android training in annanagar

devops training in chennai

devops training in annanagar

artificial intelligence training in chennai

artificial intelligence training in annanagar
ReplyDelete
Replies
deivaAugust 30, 2020 at 4:35 AM
Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
data science training in chennai

data science training in omr

android training in chennai

android training in omr

devops training in chennai

devops training in omr

artificial intelligence training in chennai

artificial intelligence training in omr
ReplyDelete
Replies

Add comment

rEpArO

Thursday, September 25, 2014

Hadoop Installation

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Prerequisites

Sun Java7 or above

Adding a dedicated Hadoop system user

Configuring SSH

Hadoop

Installation

Update $HOME/.bashrc

Configuration

hadoop-env.sh

conf/*-site.xml

Formatting the HDFS filesystem via the NameNode

Starting your single-node cluster

Stopping your single-node cluster

Hadoop Web Interfaces

NameNode Web Interface (HDFS layer)

25 comments:

About

Blogroll

Thursday, September 25, 2014

Hadoop Installation

Running Hadoop on Ubuntu Linux (Single-Node Cluster)

In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.

Prerequisites

Sun Java7 or above

Adding a dedicated Hadoop system user

Configuring SSH

Hadoop

Installation

Update $HOME/.bashrc

Configuration

hadoop-env.sh

conf/*-site.xml

Formatting the HDFS filesystem via the NameNode

Starting your single-node cluster

Stopping your single-node cluster

Hadoop Web Interfaces

NameNode Web Interface (HDFS layer)

Related Links

25 comments: