Category Archives: Hadoop – MapReduce Code

Inverted Index – Mapreduce program

What is Inverted Index?! In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. Read more here Input files… Read More »

Custom partitioner in mapreduce – using new hadoop api 2

This is the example of custom partitioner for classic wordcount program. Driver Class: We are partitioning keys based on the first letter, so we will have 27 partitions, 26 for each partition plus 1 other characters. Below are the additional things in Driver class. job.setNumReduceTasks(26); job.setPartitionerClass(WordcountPartitioner.class); package org.puneetha.customPartitioner; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import… Read More »

Pattern matching for files within a Mapreduce program – given hdfs path – using new api 2

Driver Class: package org.puneetha.patternMatching; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WordcountDriver extends Configured implements Tool { public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); /* * … Other Driver class code …… Read More »

Rename reducer output part file – using Mapreduce code (with new hadoop api 2)

Below is the code to rename our reducer output part file name from “part-*” to “customName-*”. I am using the classic wordcount example(You can check out the basic implementation here) Driver Class: In Driver class: LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); – for avoiding the creation of empty default partfiles MultipleOutputs.addNamedOutput(job, “text”, TextOutputFormat.class,Text.class, IntWritable.class); – for adding new name… Read More »

Wordcount Mapreduce program – using Hadoop new API 2

Below is the classic wordcount example, using new api. If you are using maven, you can use the pom.xml given here. Change it according to the hadoop distribution/version you are using. Input Text: $vim input.txt cat dog apple cat horse orange apple $hadoop fs -mkdir -p /user/dummyuser/wordcount/input $hadoop fs -put input.txt /user/dummyuser/wordcount/input/ Driver Class: package… Read More »