skipping the header from java map reduce code

skipping the header from java map reduce code - java

I am trying to get the summary of a csv file and the first line of the file is the header. Is there a way to make the values of each column with its header name as key value pair from the Java code.
Eg: Input file is like
A,B,C,D
1,2,3,4
5,6,7,8
I want the the output from mapper as (A,1),(B,2),(C,3),(D,4),(A,5),....
Note:I tried using overriding the run function in the Mapper class to skip the first line. But As far as I know the run function gets called for each input split and is thus not suiting my need. Any help on this will really be appreciated.
This is the way my mapper looks:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",",-1);
int length = splits.length;
// count = 0;
for (int i = 0; i < length; i++) {
columnName.set(header[i]);
context.write(columnName, new Text(splits[i]+""));
}
}
public void run(Context context) throws IOException, InterruptedException
{
setup(context);
try
{
if (context.nextKeyValue())
{
Text columnHeader = context.getCurrentValue();
header = columnHeader.toString().split(",");
}
while (context.nextKeyValue())
{
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
finally
{
cleanup(context);
}
}

I assume that the column headers are alphabets and column values are numbers.
One of the ways to achieve this, is to use DistributedCache.
Following are the steps:
Create a file containing the column headers.
In the Driver code, add this file to the distributed cache, by calling Job::addCacheFile()
In the setup() method of the mapper, access this file from the distributed cache. Parse and store the contents of the file in a columnHeader list.
In the map() method, check if the values in each record match the headers (stored in columnnHeader list). If yes, then ignore that record (Because the record just contains the headers). If no, then emit the values along with the column headers.
This is how the Mapper and Driver code looks like:
Driver:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "HeaderParser");
job.setJarByClass(WordCount.class);
job.setMapperClass(HeaderParserMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.addCacheFile(new URI("/in/header.txt#header.txt"));
FileInputFormat.addInputPath(job, new Path("/in/in7.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
Driver Logic:
Copy "header.txt" (which contains just one line: A,B,C,D) to HDFS
In the Driver, add "header.txt" to distributed cache, by executing following statement:
job.addCacheFile(new URI("/in/header.txt#header.txt"));
Mapper:
public static class HeaderParserMapper
extends Mapper<LongWritable, Text , Text, NullWritable>{
String[] headerList;
String header;
#Override
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
BufferedReader bufferedReader = new BufferedReader(new FileReader("header.txt"));
header = bufferedReader.readLine();
headerList = header.split(",");
}
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] values = line.split(",");
if(headerList.length == values.length && !header.equals(line)) {
for(int i = 0; i < values.length; i++)
context.write(new Text(headerList[i] + "," + values[i]), NullWritable.get());
}
}
}
Mapper Logic:
Override setup() method.
Read "header.txt" (which was put in distributed cache in the Driver) in the setup() method.
In the map() method, check if the line matches the header. If yes, then ignore that line. Else, output header and values as (h1,v1), (h2,v2), (h3,v3) and (h4,v4).
I ran this program on the following input:
A,B,C,D
1,2,3,4
5,6,7,8
I got the following output (where values are matched with respective header):
A,1
A,5
B,2
B,6
C,3
C,7
D,4
D,8

The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this.
One way to go is to write a custom InputFormat that does this work for you

Related

Limit no of mapper in MultipleOutput without reducer in hadoop

Hi i have an application that reads records from HBase and writes into text files HBase table has 200 regions.
I am using MultipleOutputs in the mapper class to write into multiple files and i am making file name from the incoming records .
I am making 40 unique file names .
I am able to get records properly but my problem is that when mapreduce finishes it creates 40 files and also 2k extra files with proper name but appended
with m-000 and so on.
This is because i have 200 regions and MultipleOutputs creates files for each mapper so 200 mapper and for each mapper there are 40 unique files so that is why it creates 40*200 files .
I don't know how to avoid this situation without custom partitioner .
Is there any way to force write records into belonging files only not to split into multiple files.
I have used custom partitioner class and its working fine but i don't want to use that as i am just reading from HBase and not doing reducer operation.Also if any extra file name i have to create then i have to change my code also .
Here is my mapper code
public class DefaultMapper extends TableMapper<NullWritable, Text> {
private Text text = new Text();
MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
#Override()
public void setup(Context context) throws java.io.IOException, java.lang.InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
String FILE_NAME = new String(value.getValue(Bytes.toBytes(HbaseBulkLoadMapperConstants.COLUMN_FAMILY),Bytes.toBytes(HbaseBulkLoadMapperConstants.FILE_NAME)));
multipleOutputs.write(NullWritable.get(), new Text(text.toString()),FILE_NAME);
//context.write(NullWritable.get(), text);
}
No reducer class
This is how my output looks like ideally only one Japan.BUS.gz file should be created.Other files are very small files also
Japan.BUS-m-00193.gz
Japan.BUS-m-00194.gz
Japan.BUS-m-00195.gz
Japan.BUS-m-00196.gz

I had encountered the same situation and made a solution for it also.
MultipleOutputs multipleOutputs = null;
String keyToFind = new String();
public void setup(Context context) throws IOException, InterruptedException
{
this.multipleOutputs_normal = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
}
public void map(NullWritable key , Text values, Context context) throws IOException, InterruptedException
{
String valToFindInCol[] = values.toString.split(",");/** Lets say comma seperated **/
if (keyToFind .equals(valToFindInCol[2].toString())|| keyToFind == null) /** Say you need to match 2 position element **/
{
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
else
{
this.multipleOutputs.close();
this.multipleOutputs = null;
this.multipleOutputs = new MultipleOutputs<KEYOUT, VALUEOUT>(context);
this.multipleOutputs.write(NullWritable.get(),<valToWrite>, valToFindInCol[2]);
}
keyToFind=valToFindInCol[2];
}

Spring Batch Write Header

I have a spring-batch file that extracts data from a database and writes it to a .CSV file.
I would like to add the names of the columns that are extracted as the headers of the file without hard coding them on the file.
Is possible to write the header when I get the results or is there another solution?
Thanks

fileItemWriter.setHeaderCallback(new FlatFileHeaderCallback() {
public void writeHeader(Writer writer) throws IOException {
writer.write(Arrays.toString(names));
}
});
[names] can be fetched using reflections from the domain class you created for the column names to be used by rowMapper, something like below :
private String[] reflectFields() throws ClassNotFoundException {
Class job = Class.forName("DomainClassName");
Field[] fields = FieldUtils.getAllFields(job);
names = new String[fields.length];
for(int i=0; i<fields.length; i++){
names[i] = fields[i].getName();
}
return names;
}

How to count the occurence of particular word in a file using hadoop mapreduce programming?

I am trying to count the occurrence of a particular word in a file using hadoop mapreduce programming in java. Both the file and the word should be an user input. So I am trying to pass the particular word as third argument along with the i/p and o/p paths(In, Out, Word). But i am not able to find out a way to pass the word to the map function.
I have tried the following way but it did not work:
- created a static String variable in mapper class and assigned the value of my 3rd argument(ie. word to be searched) to it. And then tried to use this static variable inside map function. But inside map function the static variables value came as Null.
I am unable to get the third arument's value inside map function.
Is there anyway to set the value via JobConf object? Please help. I have pasted my code below.
public class MyWordCount {
public static class MyWordCountMap extends Mapper < Text, Text, Text, LongWritable > {
static String wordToSearch;
private final static LongWritable ONE = new LongWritable(1L);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
System.out.println(wordToSearch); // Here the value is coming as Null
if (value.toString().compareTo(wordToSearch) == 0) {
context.write(word, ONE);
}
}
}
public static class SumReduce extends Reducer < Text, LongWritable, Text, LongWritable > {
public void reduce(Text key, Iterator < LongWritable > values,
Context context) throws IOException, InterruptedException {
long sum = 0L;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new LongWritable(sum));
}
}
public static void main(String[] rawArgs) throws Exception {
GenericOptionsParser parser = new GenericOptionsParser(rawArgs);
Configuration conf = parser.getConfiguration();
String[] args = parser.getRemainingArgs();
Job job = new Job(conf, "wordcount");
job.setJarByClass(MyWordCountMap.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(MyWordCountMap.class);
job.setReducerClass(SumReduce.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
String MyWord = args[2];
MyWordCountMap.wordToSearch = MyWord;
job.waitForCompletion(true);
}
}

There is a way to do this with Configuration (see api here). As an example, the following code can be used which sets "Tree" as the word to be searched:
//Create a new configuration
Configuration conf = new Configuration();
//Set the work to be searched
conf.set("wordToSearch", "Tree");
//create the job
Job job = new Job(conf);
Then, in your mapper/reducer class you can get wordToSearch (i.e., "Tree" in this example) using the following:
//Create a new configuration
Configuration conf = context.getConfiguration();
//retrieve the wordToSearch variable
String wordToSearch = conf.get("wordToSearch");
See here for more details.

Java TreeMap<String, String> returns null even if the key exists

I have a class which extends TreeMap with one external method.
The external method "open" suppose to read lines from a given file in the following format "word:meaning" and add it to the TreeMap - put("word", "meaning").
So I read the file with RandomAccessFile and put the keys-values in the TreeMap and when I print the TreeMap I can see the proper keys and values, for example:
{AAAA=BBBB, CAB=yahoo!}
But for some reason when I do get("AAAA") I get null.
Any reason why it's happening and how to solve it?
Here is the code
public class InMemoryDictionary extends TreeMap<String, String> implements
PersistentDictionary {
private static final long serialVersionUID = 1L; // (because we're extending
// a serializable class)
private File dictFile;
public InMemoryDictionary(File dictFile) {
super();
this.dictFile = dictFile;
}
#Override
public void open() throws IOException {
clear();
RandomAccessFile file = new RandomAccessFile(dictFile, "rw");
file.seek(0);
String line;
while (null != (line = file.readLine())) {
int firstColon = line.indexOf(":");
put(line.substring(0, firstColon - 1),
line.substring(firstColon + 1, line.length() - 1));
}
file.close();
}
#Override
public void close() throws IOException {
dictFile.delete();
RandomAccessFile file = new RandomAccessFile(dictFile, "rw");
file.seek(0);
for (Map.Entry<String, String> entry : entrySet()) {
file.writeChars(entry.getKey() + ":" + entry.getValue() + "\n");
}
file.close();
}
}

the "question marks" from a previous version of your question are important. they indicate that the strings you thought you were seeing are not in fact the strings you are using. RandomAccessFile is a poor choice to read a text file. You are presumably reading a text file with a text encoding which is not single byte (utf-16 perhaps)? the resulting strings are mis-encoded since RandomAccessFile does an "ascii" character conversion. this is causing your get() call to fail.
first, figure out the character encoding of your file and open it with the appropriately configured InputStreamReader.
second, extending TreeMap is a very poor design. Use aggregation here, not extension.

Hadoop Multiple Outputs

I wrote some hadoop code to read the mapped file and split it into chunks and write it to many files as follows:
public void map(LongWritable key, Text value, OutputCollector<IntWritable, Text>
output,Reporter reporter) throws IOException {
String line = value.toString();
int totalLines = 2000;
int lines = 0;
int fileNum = 1;
String[] linesinfile = line.split("\n");
while(lines<linesinfile.length) {
// I do something like, if lines = totalLines, {
output.collect(new IntWritable(fileNum), new
Text(linesinfile[lines].toString()));
fileNum++;
lines = 0;
}
lines++;
}
}
In reduce, I do:
public void reduce(IntWritable key, Iterator<Text> values,
OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
while(values.hasNext()){
output.collect(key, values.next());
}
}
My MultiFile class is as follows:
public class MultiFileOutput extends MultipleTextOutputFormat<IntWritable, Text> {
protected String generateFileNameForKeyValue(IntWritable key, Text content, String
fileName) {
return key.toString() + "-" + fileName;
}
}
In main, I say:
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(MultiFileOutput.class);
apart from setting the OutKey/Value Class etc.
What am I doing wrong ? My output directory is always empty.
Thanks

The program looks a bit complex. If the purpose is to split the file into multiple files then it can be done in a couple of ways. There is no Need for a Map and Reduce job, just a Map job would be enough.
Use o.a.h.mapred.lib.NLineInputFormat to read N lines at a time to the mapper from the input and then write those N lines to a file.
Set the dfs.blocksize to the required file size while uploading the file, then each mapper will process one InputSplit which can be written to a file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

skipping the header from java map reduce code - java

The accepted answer by #Manjunath Ballur works as a good hack. But, Map Reduce must be used in conjunction to simplicity. Checking the header for each line is not the recommended way to do this. One way to go is to write a custom InputFormat that does this work for you

Related

Limit no of mapper in MultipleOutput without reducer in hadoop

Spring Batch Write Header

How to count the occurence of particular word in a file using hadoop mapreduce programming?

Java TreeMap<String, String> returns null even if the key exists

Hadoop Multiple Outputs

Categories

Resources