Monday, November 4, 2013

Hadoop Core (HDFS and YARN) Components Explained

It's critical to understand the core components in Hadoop YARN (Yet Another Resource Negotiator) or MapReduce 2.0, and how the components interact with each other in the system. Following tutorial will explain those components and there are reference links at the bottom you can follow to read up more details.

If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

NameNode (Hadoop FileSystem Component)

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.


DateNode (Hadoop FileSystem Component)

A DataNode stores the actual data in the HDFS. A functional filesystem typically have more than one DataNode in the cluster, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.



A quickstart tutorial on HDFS can be Hadoop FileSystem (HDFS) Tutorial 1


Application Submission in YARN

1. Application Submission Client submits an Application to the YARN Resource Manager. The client needs to provide sufficient information to the ResourceManager in order to launch ApplicationMaster

2. YARN ResourceManager starts ApplicationMaster.

3. The ApplicationMaster then communicates with the ResourceManager to request resource allocation.

4. After a container is allocated to it, the ApplicationMaster communicates with the NodeManager to launch the tasks in the container.


Resource Manager (YARN Component)

The function of the Resource Manager is simple: Keeping track of available resources. One per cluster. It contains two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.


Application Master (YARN Component)

Application Master is created for each application running in the cluster. It provides task-level scheduling and monitoring.


Node Manager (YARN Component)

The NodeManager is the per-machine framework agent who creates container for each task. The containers can have variable resource sizes and the task can be any type of computations not just map/reduce tasks. It then monitors the resource usage (cpu, memory, disk, network) of the container and report them to the ResourceManager.

Reference Links

Apache Hadoop NextGen MapReduce (YARN)
Yahoo Hadoop Tutorial
More reference links to be added...


Please feel to leave me any comments or suggestions below.

Sunday, October 27, 2013

Hadoop FileSystem (HDFS) Tutorial 1

In this tutorial I will show some common commands for HDFS operations.
If you don't have Hadoop setup in your linux, you can follow Hadoop Setup Guide

Log into Linux, "hduser" is the login used in following examples.

Start Hadoop If it's not running
$ start-dfs.sh
....
$ start-yarn.sh
Create someFile.txt in your home directory
hduser@ubuntu:~$ vi someFile.txt

Paste any text you want in to the file and save it.
Create Home Directory In HDFS (If it doesn't exist)
hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser
Copy file someFile.txt from local disk to the user’s directory in HDFS.
hduser@ubuntu:~$ hadoop fs -copyFromLocal someFile.txt someFile.txt
Get a directory listing of the user’s home directory in HDFS
hduser@ubuntu:~$ hadoop fs –ls


Found 1 items
-rw-r--r--   1 hduser supergroup          5 2013-10-27 17:57 someFile.txt

Display the contents of the HDFS file /user/hduser/someFile.txt
hduser@ubuntu:~$ hadoop fs –cat /user/hduser/someFile.txt
Get a directory listing of the HDFS root directory
hduser@ubuntu:~$ hadoop fs –ls /
copy that file to the local disk, named as someFile2.txt
hduser@ubuntu:~$ hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt
Delete the file from hadoop hdfs
hduser@ubuntu:~$ hadoop fs –rm someFile.txt

Deleted someFile.txt


For a full list of commands, Please visit HDFS FileSystem Shell Commands. Please feel free to leave me any comments or suggestions.

Monday, October 21, 2013

Setup newest Hadoop 2.x (2.2.0) on Ubuntu

In this tutorial I am going to guide you through setting up hadoop 2.2.0 environment on Ubuntu.

Prerequistive

$ sudo apt-get install openjdk-7-jdk
$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
$ cd /usr/lib/jvm
$ ln -s java-7-openjdk-amd64 jdk

$ sudo apt-get install openssh-server

Add Hadoop Group and User

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
$ sudo adduser hduser sudo
After user is created, re-login into ubuntu using hduser

Setup SSH Certificate

$ ssh-keygen -t rsa -P ''
...
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
...
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

Download Hadoop 2.2.0

$ cd ~
$ wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.2.0/hadoop-2.2.0.tar.gz
$ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.2.0 hadoop
$ sudo chown -R hduser:hadoop hadoop

Setup Hadoop Environment Variables

$cd ~
$vi .bashrc

paste following to the end of the file

#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
###end of paste

$ cd /usr/local/hadoop/etc/hadoop
$ vi hadoop-env.sh

#modify JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk/
Re-login into Ubuntu using hdser and check hadoop version
$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
At this point, hadoop is installed.

Configure Hadoop

$ cd /usr/local/hadoop/etc/hadoop
$ vi core-site.xml
#Paste following between <configuration>


   fs.default.name
   hdfs://localhost:9000



$ vi yarn-site.xml
#Paste following between <configuration>


   yarn.nodemanager.aux-services
   mapreduce_shuffle


   yarn.nodemanager.aux-services.mapreduce.shuffle.class
   org.apache.hadoop.mapred.ShuffleHandler



$ mv mapred-site.xml.template mapred-site.xml
$ vi mapred-site.xml
#Paste following between <configuration>


   mapreduce.framework.name
   yarn



$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode
$ cd /usr/local/hadoop/etc/hadoop
$ vi hdfs-site.xml
Paste following between <configuration> tag


   dfs.replication
   1
 
 
   dfs.namenode.name.dir
   file:/home/hduser/mydata/hdfs/namenode
 
 
   dfs.datanode.data.dir
   file:/home/hduser/mydata/hdfs/datanode
 

Format Namenode

hduser@ubuntu40:~$ hdfs namenode -format

Start Hadoop Service

$ start-dfs.sh
....
$ start-yarn.sh
....

hduser@ubuntu40:~$ jps
If everything is sucessful, you should see following services running
2583 DataNode
2970 ResourceManager
3461 Jps
3177 NodeManager
2361 NameNode
2840 SecondaryNameNode

Run Hadoop Example

hduser@ubuntu: cd /usr/local/hadoop
hduser@ubuntu:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5

Number of Maps  = 2
Samples per Map = 5
13/10/21 18:41:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/10/21 18:41:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/10/21 18:41:04 INFO input.FileInputFormat: Total input paths to process : 2
13/10/21 18:41:04 INFO mapreduce.JobSubmitter: number of splits:2
13/10/21 18:41:04 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
...

Note: ericduq has created a shell script (make-single-node.sh) for this setup and it is available at git repo at https://github.com/ericduq/hadoop-scripts.

What to read next
Hadoop FileSystem (HDFS) Tutorial 1
Hadoop 2.x Core (HDFS and YARN) Components Explained
Hadoop Wordcount example

Feel free to leave comments below. I will have more hadoop tutorials added regularly.

Friday, October 18, 2013

Hadoop WordCount with new map reduce api

There are so many version of WordCount hadoop example flowing around the web. However, a lot of them are using the older version of hadoop api. Following are example of word count using the newest hadoop map reduce api. The new map reduce api reside in org.apache.hadoop.mapreduce package instead of org.apache.hadoop.mapred.

WordMapper.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);
 
 @Override
 public void map(Object key, Text value,
   Context contex) throws IOException, InterruptedException {
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(value.toString());
  while (wordList.hasMoreTokens()) {
   word.set(wordList.nextToken());
   contex.write(word, one);
  }
 }
}

SumReducer.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;



public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 
 private IntWritable totalWordCount = new IntWritable();
 
 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount += it.next().get();
  }
  totalWordCount.set(wordCount);
  context.write(key, totalWordCount);
 }
}

WordCount.java (Driver)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
          System.exit(-1);
        }
  
  
        Job job = Job.getInstance(new Configuration());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(WordMapper.class); 
        job.setReducerClass(SumReducer.class);  

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(WordCount.class);

        job.submit();
        
        
        
  

  
 }
}

Monday, April 15, 2013

Java IO vs NIO

Java NIO (New IO) was introduced in JDK 1.4. While Java IO centralizes around Stream/Reader/Writer and uses decorator pattern as its main design, where you have decorate one InputStream type with each other in the right order. The NIO uses Channel/Buffer/Selector.

Read a file using IO


File file =new File("C:\\test\\test.txt");
FileInputStream  fis = new FileInputStream(file);

//decorate BufferedReader with InputStreamReader
BufferedReader d = new BufferedReader(new InputStreamReader(fis));
String line=null;
while((line = d.readLine()) != null){
         System.out.println(line);
}

Read a file using NIO

RandomAccessFile rFile = new RandomAccessFile("C:\\test\\test.txt", "rw");
   
//get Channel
FileChannel inChannel = rFile.getChannel();
   
//Intialize byte buffer
ByteBuffer buf = ByteBuffer.allocate(100);
   
//read data to buffer
while(inChannel.read(buf)>0){
 buf.flip(); //flip into read mode
 while(buf.hasRemaining()){
  System.out.print((char)buf.get());
 }
 buf.clear();
    
}
rFile.close();

Monday, October 15, 2012

Customize Spring JSON output

Spring uses Jackson JSON from http://jackson.codehaus.org/ to serialize an object for json output. To change the default strategy to serialize/format a field in a bean, you can easily create your own serializer. Following is an example to customize date format in json oupt

1. Creat serializer

public class CustomDateSerializer extends JsonSerializer {
 public static Log logger=LogFactory.getLog(CustomDateSerializer.class);
 
 @Override
 public void serialize(Date value, JsonGenerator gen, SerializerProvider arg2) throws IOException, JsonProcessingException {
  
  SimpleDateFormat formatter = new SimpleDateFormat("MM-dd-yyyy");
                String formattedDate = formatter.format(value);

  gen.writeString(formattedDate);

 }
}

2. Annotate bean property to use the serializer


public class TransactionHistoryBean{
  private Date date;

  @JsonSerialize(using = CustomDateSerializer.class)
  public Date getDate() {
      return date;
  }
}

Monday, June 11, 2012

Spring Session Scope Bean

Since HTTP stateless, in a web application, a typical way to keep an user states across multiple requests is through HttpSession. However, doing it this way makes your application has tight dependency on HttpSession and its application container. With Spring, carrying states through multiple request can be done through its session scoped bean. It is cleaner and more flexible. Spring session scoped bean is very easy setup, I will try to explain it in a couple steps.

Scenario: Whenever an user enters your site, he is asked to pick a theme to use when navigating your site.
Solution: Store user's selection using Spring Session scoped bean

1. Annotate your bean

@Component
@Scope(value = "session", proxyMode = ScopedProxyMode.TARGET_CLASS)
public class UserSelection{
   private String selectedTheme;
   //and other session values

   public void setSelectedTheme(String theme){
        this.selectedTheme=selectedTheme;
   }

   public String getSelectedTheme(){
       return this.selectedTheme;
   }

}

2. Inject your session scoped bean

@Controller
@RequestMapping("/")
public class SomeController{
   private UserSelection userSelection;

   @Resource
   public void setUserSelection(UserSelectio userSelection){
         this.userSelection=userSelection;
   }

   @RequestMapping("saveThemeChoice")
   public void saveThemeChoice(@RequestParam String selectedTheme){
        userSelection.setSelectedTheme(selectedTheme);
   }

   
   public String doOperationBasedOnTheme(){
          String theme=userSelection.selectedTheme();
 
          //operation codes
   }

}
In above code snippet, Although 'SomeController' has scope of singleton(default scope of Spring bean), each session will have its own instance of UserSelection. Spring will make the judgement, and inject a new instance of UserSelection if the request is a new session.

Required jars

Besides spring library, aopalliance and cglib jar are also required
http://mvnrepository.com/artifact/cglib/cglib-nodep
http://mvnrepository.com/artifact/aopalliance/aopalliance