Changing number of splits for Hadoop job - java

I am currently writing code to process a single image using Hadoop, so my input is only one file (.png). I have working code that will run a job, but instead of running sequential mappers, it runs only one mapper and never spawns other mappers.
I have created my own extensions of the FileInputFormat and RecordReader classes in order to create (what I thought were) "n" custom splits -> "n" map tasks.
I've been searching the web like crazy for examples of this nature to learn from, but all I've been able to find are examples which deal with using entire files as a split (meaning exactly one mapper) or using a fixed number of lines from a text file (e.g., 3) per map task.
What I'm trying to do is send a pair of coordinates ((x1, y1), (x2, y2)) to each mapper where the coordinates correspond to the top-left/bottom-right pixels of some rectangle in the image.
Any suggestions/guidance/examples/links to examples would greatly be appreciated.
Custom FileInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class FileInputFormat1 extends FileInputFormat
{
#Override
public RecordReader createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
return new RecordReader1();
}
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return true;
}
}
Custom RecordReader
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
public class RecordReader1 extends RecordReader<KeyChunk1, NullWritable> {
private KeyChunk1 key;
private NullWritable value;
private ImagePreprocessor IMAGE;
public RecordReader1()
{
}
#Override
public void close() throws IOException {
}
#Override
public float getProgress() throws IOException, InterruptedException {
return IMAGE.getProgress();
}
#Override
public KeyChunk1 getCurrentKey() throws IOException, InterruptedException {
return key;
}
#Override
public NullWritable getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean gotNextValue = IMAGE.hasAnotherChunk();
if (gotNextValue)
{
if (key == null)
{
key = new KeyChunk1();
}
if (value == null)
{
value = NullWritable.get();
}
int[] data = IMAGE.getChunkIndicesAndIndex();
key.setChunkIndex(data[2]);
key.setStartRow(data[0]);
key.setStartCol(data[1]);
key.setChunkWidth(data[3]);
key.setChunkHeight(data[4]);
}
else
{
key = null;
value = null;
}
return gotNextValue;
}
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
Configuration config = taskAttemptContext.getConfiguration();
IMAGE = new ImagePreprocessor(
config.get("imageName"),
config.getInt("v_slices", 1),
config.getInt("h_slices", 1),
config.getInt("kernel_rad", 2),
config.getInt("grad_rad", 1),
config.get("hdfs_address"),
config.get("local_directory")
);
}
}
ImagePreprocessor Class (Used in custom RecordReader - only showing necessary information)
import java.awt.image.BufferedImage;
import java.io.IOException;
public class ImagePreprocessor {
private String filename;
private int num_v_slices;
private int num_h_slices;
private int minSize;
private int width, height;
private int chunkWidth, chunkHeight;
private int indexI, indexJ;
String hdfs_address, local_directory;
public ImagePreprocessor(String filename, int num_v_slices, int num_h_slices, int kernel_radius, int gradient_radius,
String hdfs_address, String local_directory) throws IOException{
this.hdfs_address = hdfs_address;
this.local_directory = local_directory;
// all "validate" methods throw errors if input data is invalid
checkValidFilename(filename);
checkValidNumber(num_v_slices, "vertical strips");
this.num_v_slices = num_v_slices;
checkValidNumber(num_h_slices, "horizontal strips");
this.num_h_slices = num_h_slices;
checkValidNumber(kernel_radius, "kernel radius");
checkValidNumber(gradient_radius, "gradient radius");
this.minSize = 1 + 2 * (kernel_radius + gradient_radius);
getImageData(); // loads image and saves width/height to class variables
validateImageSize();
chunkWidth = validateWidth((int)Math.ceil(((double)width) / num_v_slices));
chunkHeight = validateHeight((int)Math.ceil(((double)height) / num_h_slices));
indexI = 0;
indexJ = 0;
}
public boolean hasAnotherChunk()
{
return indexI < num_h_slices;
}
public int[] getChunkIndicesAndIndex()
{
int[] ret = new int[5];
ret[0] = indexI;
ret[1] = indexJ;
ret[2] = indexI*num_v_slices + indexJ;
ret[3] = chunkWidth;
ret[4] = chunkHeight;
indexJ += 1;
if (indexJ >= num_v_slices)
{
indexJ = 0;
indexI += 1;
}
return ret;
}
}
Thank you for your time!

You should override method public InputSplit[] getSplits(JobConf job, int numSplits) in your FileInputFormat1 class. Create your own class based on InputSplit with rectangle coordinates, so inside FileInputFormat you can get this information to return correct key/value pairs to mapper.
Probably implementation of getSplits in FileInputFormat could help you see here.

Related

Read parquet data from ByteArrayOutputStream instead of file

I would like to convert this code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.column.page.PageReadStore;
import org.apache.parquet.example.data.simple.SimpleGroup;
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
import org.apache.parquet.hadoop.ParquetFileReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.ColumnIOFactory;
import org.apache.parquet.io.MessageColumnIO;
import org.apache.parquet.io.RecordReader;
import org.apache.parquet.schema.MessageType;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class ParquetReaderUtils {
public static Parquet getParquetData(String filePath) throws IOException {
List<SimpleGroup> simpleGroups = new ArrayList<>();
ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), new Configuration()));
MessageType schema = reader.getFooter().getFileMetaData().getSchema();
//List<Type> fields = schema.getFields();
PageReadStore pages;
while ((pages = reader.readNextRowGroup()) != null) {
long rows = pages.getRowCount();
MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
simpleGroups.add(simpleGroup);
}
}
reader.close();
return new Parquet(simpleGroups, schema);
}
}
(which is from https://www.arm64.ca/post/reading-parquet-files-java/)
to take a ByteArrayOutputStream parameter instead of a filePath.
Is this possible? I don't see a ParquetStreamReader in org.apache.parquet.hadoop.
Any help is appreciated. I am trying to write a test app for parquet coming from kafka and writing each of many messages out to a file is rather slow.
So without deeper testing, I would try with this class (albeit the content of the outputstream should be parquet-compatible). I put there a streamId to make the identification of the processed bytearray easier (the ParquetFileReader prints the instance.toString() out if something went wrong).
public class ParquetStream implements InputFile {
private final String streamId;
private final byte[] data;
private static class SeekableByteArrayInputStream extends ByteArrayInputStream {
public SeekableByteArrayInputStream(byte[] buf) {
super(buf);
}
public void setPos(int pos) {
this.pos = pos;
}
public int getPos() {
return this.pos;
}
}
public ParquetStream(String streamId, ByteArrayOutputStream stream) {
this.streamId = streamId;
this.data = stream.toByteArray();
}
#Override
public long getLength() throws IOException {
return this.data.length;
}
#Override
public SeekableInputStream newStream() throws IOException {
return new DelegatingSeekableInputStream(new SeekableByteArrayInputStream(this.data)) {
#Override
public void seek(long newPos) throws IOException {
((SeekableByteArrayInputStream) this.getStream()).setPos((int) newPos);
}
#Override
public long getPos() throws IOException {
return ((SeekableByteArrayInputStream) this.getStream()).getPos();
}
};
}
#Override
public String toString() {
return "ParquetStream[" + streamId + "]";
}
}

How do wrapped types work in Hadoop?

I'm not a Java expert, but I know the basics of Java and I always try to understand Java code in depth always whenever it come across.
It could be a really silly doubt but would love to make it clear understanding in my mind.
I'm posting in the Java community, because my doubt is about Java only.
Since the last couple of months I am working with hadoop and came across that hadoop uses its own types, which are wrapped around Java's primitive types in order to increase efficiency to send data across network on the basis of serialization and deserialization.
My confusion starts from here, Lets say we have some data in HDFS to be processed using following Java code running in hadoop code
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper
{
extends Mapper<LongWritable,Text,Text,IntWritable>
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
}
}
String line = value.toString();
for (String word : line.split(" ")){
if(word.length()>0){
context.write(new Text(word),new IntWritable(1));
}
In this code hadoop's types are like this LongWritable, Text, IntWritable.
Lets pick up Text type which is wrapped around String type of Java (correct me if am wrong).
My doubt here is when we are passing these parameters to our method map in the above code, how these parameters gets interact with the code which is in import package i.e org.apache.hadoop.io.Text;
Below is the Text class code
package org.apache.hadoop.io;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.MalformedInputException;
import java.text.CharacterIterator;
import java.text.StringCharacterIterator;
import java.util.Arrays;
import org.apache.avro.reflect.Stringable;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.classification.InterfaceAudience.Public;
import org.apache.hadoop.classification.InterfaceStability.Stable;
#Stringable
#InterfaceAudience.Public
#InterfaceStability.Stable
public class Text
extends BinaryComparable
implements WritableComparable<BinaryComparable>
{
private static final Log LOG = LogFactory.getLog(Text.class);
private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY = new ThreadLocal()
{
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static ThreadLocal<CharsetDecoder> DECODER_FACTORY = new ThreadLocal()
{
protected CharsetDecoder initialValue() {
return Charset.forName("UTF-8").newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static final byte[] EMPTY_BYTES = new byte[0];
private byte[] bytes;
private int length;
public Text()
{
bytes = EMPTY_BYTES;
}
public Text(String string)
{
set(string);
}
public Text(Text utf8)
{
set(utf8);
}
public Text(byte[] utf8)
{
set(utf8);
}
public byte[] getBytes()
{
return bytes;
}
public int getLength()
{
return length;
}
public int charAt(int position)
{
if (position > length) return -1;
if (position < 0) { return -1;
}
ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
return bytesToCodePoint(bb.slice());
}
public int find(String what) {
return find(what, 0);
}
public int find(String what, int start)
{
try
{
ByteBuffer src = ByteBuffer.wrap(bytes, 0, length);
ByteBuffer tgt = encode(what);
byte b = tgt.get();
src.position(start);
while (src.hasRemaining()) {
if (b == src.get()) {
src.mark();
tgt.mark();
boolean found = true;
int pos = src.position() - 1;
while (tgt.hasRemaining()) {
if (!src.hasRemaining()) {
tgt.reset();
src.reset();
found = false;
}
else if (tgt.get() != src.get()) {
tgt.reset();
src.reset();
found = false;
}
}
if (found) return pos;
}
}
return -1;
}
catch (CharacterCodingException e) {
e.printStackTrace(); }
return -1;
}
public void set(String string)
{
try
{
ByteBuffer bb = encode(string, true);
bytes = bb.array();
length = bb.limit();
} catch (CharacterCodingException e) {
throw new RuntimeException("Should not have happened " + e.toString());
}
}
public void set(byte[] utf8)
{
set(utf8, 0, utf8.length);
}
public void set(Text other)
{
set(other.getBytes(), 0, other.getLength());
}
public void set(byte[] utf8, int start, int len)
{
setCapacity(len, false);
System.arraycopy(utf8, start, bytes, 0, len);
length = len;
}
public void append(byte[] utf8, int start, int len)
{
setCapacity(length + len, true);
System.arraycopy(utf8, start, bytes, length, len);
length += len;
}
public void clear()
{
length = 0;
}
private void setCapacity(int len, boolean keepData)
{
if ((bytes == null) || (bytes.length < len)) {
if ((bytes != null) && (keepData)) {
bytes = Arrays.copyOf(bytes, Math.max(len, length << 1));
} else {
bytes = new byte[len];
}
}
}
public String toString()
{
try
{
return decode(bytes, 0, length);
} catch (CharacterCodingException e) {
throw new RuntimeException("Should not have happened " + e.toString());
}
}
public void readFields(DataInput in)
throws IOException
{
int newLength = WritableUtils.readVInt(in);
setCapacity(newLength, false);
in.readFully(bytes, 0, newLength);
length = newLength;
}
public static void skip(DataInput in) throws IOException
{
int length = WritableUtils.readVInt(in);
WritableUtils.skipFully(in, length);
}
public void write(DataOutput out)
throws IOException
{
WritableUtils.writeVInt(out, length);
out.write(bytes, 0, length);
}
public boolean equals(Object o)
{
if ((o instanceof Text))
return super.equals(o);
return false;
}
May I know please when ever we run the above hadoop's code, data in HDFS flows across the parameters we have mentioned in the map method.
Once the first data set from HDFS hits the Text parameter how it flows inside the org.apache.hadoop.io.Text class?
I mean from where does it start (I'm assuming it's starting from set method in class because it has kind of same parameters as mentioned map method, am I correct?)
Where does it change from normal string type to Text type in code?
My Second doubt is: when data is stored in Text type, then who kicks it to start doing serilzation? I mean who calls this write(DataOutput out), and who calls readFields(DataInput in) once data is reached to its destination on network?
How does it work, and where do I need to look?
I hope what I am asking is clear.
Like all network or disk operations, everything is transferred as bytes. The Text class deserializes bytes to UTF-8. The Writables determine how data is represented and Comparables determine how data is ordered.
The InputFormat set in the Job determines what Writables are given to a map or reduce Task.
An InputSplit determines how to split and read a raw byte stream into the Writables
One map task is started on each InputSplit
Refer https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

org.apache.commons.logging.Log cannot be resolved

When I am trying to declare an byte array using private byte[] startTag;.
Eclipse show this line as erroneous.
Hovering over it, I get this message:
The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files
I tried adding jar file in the classpath by viewing other solutions, I'm but unable to remove the error.
What should I do now?
If any specific jar file needs to be added please mention it.
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class XmlInputFormat extends TextInputFormat {
public static final String START_TAG_KEY = "< student>";
public static final String END_TAG_KEY = "</student>";
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}
public static class XmlRecordReader extends
RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
#Override
public void initialize(InputSplit is, TaskAttemptContext tac)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit) is;
String START_TAG_KEY = "<employee>";
String END_TAG_KEY = "</employee>";
startTag = START_TAG_KEY.getBytes("utf-8");
endTag = END_TAG_KEY.getBytes("utf-8");
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();
FileSystem fs =file.getFileSystem(tac.getConfiguration());
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);
}
#Override
public boolean nextKeyValue() throws
IOException,InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
value.set(buffer.getData(), 0,
buffer.getLength());
key.set(fsin.getPos());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException,
InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException,
InterruptedException {
return (fsin.getPos() - start) / (float) (end - start);
}
#Override
public void close() throws IOException {
fsin.close();
}
private boolean readUntilMatch(byte[] match, boolean
withinBlock)throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
if (b == -1)
return false;
if (withinBlock)
buffer.write(b);
if (b == match[i]) {
i++;
if (i >= match.length)
return true;
} else
i = 0;
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
}
}
I have solved the issue, finding the .jar library inside the $HADOOP_HOME.
I post an image to explain better:
I've also answered on this thread, for a similar problem:
https://stackoverflow.com/a/73427233/6685449

When writing to context in the Reducer.reduce method, why is the toString method invoked and not the write method?

I'm writing a map-reduce batch job, which consists of 3-4 chained jobs. In the second job I'm using a custom class as the output value class when writing to context via context.write().
When studying the behavior of the code, I noticed that the toString method of this custom class is invoked, rather then the write method. Why does this happen, if the class implements the Writable interface, and I implemented the write method?
The custom class's code:
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class WritableLongPair implements Writable {
private long l1;
private long l2;
public WritableLongPair() {
l1 = 0;
l2 = 0;
}
public WritableLongPair(long l1, long l2) {
this.l1 = l1;
this.l2 = l2;
}
#Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(l1);
dataOutput.writeLong(l2);
}
#Override
public void readFields(DataInput dataInput) throws IOException {
l1 = dataInput.readLong();
l2 = dataInput.readLong();
}
#Override
public String toString() {
return l1 + " " + l2;
}
}
The second job's code:
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class Phase2 {
private static final int ASCII_OFFSET = 97;
public static class Mapper2
extends Mapper<Object, Text, Text, LongWritable>{
#Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] valueAsStrings = value.toString().split("\t");
String actualKey = valueAsStrings[0];
LongWritable actualValue = new LongWritable(Long.parseLong(valueAsStrings[1]));
String[] components = actualKey.toString().split("[$]");
if (!components[1].equals("*")) {
context.write(new Text(components[1] + "$" + components[0]), actualValue);
context.write(new Text(components[1] + "$*"), actualValue);
}
context.write(new Text(actualKey), actualValue);
}
}
public static class Partitioner2 extends Partitioner<Text, LongWritable> {
#Override
public int getPartition(Text text, LongWritable longWritable, int i) {
return (int)(text.toString().charAt(0)) - ASCII_OFFSET;
}
}
public static class Reducer2
extends Reducer<Text, LongWritable, Text, WritableLongPair> {
private Text currentKey;
private long sum;
#Override
public void setup(Context context) {
currentKey = new Text();
currentKey.set("");
sum = 0l;
}
private String textContent(String w1, String w2) {
if (w2.equals("*"))
return w1 + "$*";
if (w1.compareTo(w2) < 0)
return w1 + "$" + w2;
else
return w2 + "$" + w1;
}
public void reduce(Text key, Iterable<LongWritable> counts,
Context context
) throws IOException, InterruptedException {
long sumPair = 0l;
String[] components = key.toString().split("[$]");
for (LongWritable count : counts) {
if (currentKey.equals(components[0])) {
if (components[1].equals("*"))
sum += count.get();
else
sumPair += count.get();
}
else {
sum = count.get();
currentKey.set(components[0]);
}
}
if (!components[1].equals("*"))
context.write(new Text(textContent(components[0], components[1])), new WritableLongPair(sumPair, sum));
}
}
public static class Comparator2 extends WritableComparator {
#Override
public int compare(WritableComparable o1, WritableComparable o2) {
String[] components1 = o1.toString().split("[$]");
String[] components2 = o2.toString().split("[$]");
if (components1[1].equals("*") && components2[1].equals("*"))
return components1[0].compareTo(components2[0]);
if (components1[1].equals("*")) {
if (components1[0].equals(components2[0]))
return -1;
else
return components1[0].compareTo(components2[0]);
}
if (components2[1].equals("*")) {
if (components1[0].equals(components2[0]))
return 1;
else
return components1[0].compareTo(components2[0]);
}
return components1[0].compareTo(components2[0]);
}
}
}
...and how I define my jobs:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Manager {
public static void main(String[] args) throws Exception {
Configuration conf1 = new Configuration();
if (args.length != 2) {
System.err.println("Usage: Manager <in> <out>");
System.exit(1);
}
Job job1 = Job.getInstance(conf1, "Phase 1");
job1.setJarByClass(Phase1.class);
job1.setMapperClass(Phase1.Mapper1.class);
job1.setPartitionerClass(Phase1.Partitioner1.class);
// job1.setCombinerClass(Phase1.Combiner1.class);
job1.setReducerClass(Phase1.Reducer1.class);
job1.setInputFormatClass(SequenceFileInputFormat.class);
// job1.setOutputFormatClass(FileOutputFormat.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(LongWritable.class);
job1.setNumReduceTasks(12);
FileInputFormat.addInputPath(job1, new Path(args[0]));
Path output1 = new Path(args[1]);
FileOutputFormat.setOutputPath(job1, output1);
boolean result = job1.waitForCompletion(true);
Counter counter = job1.getCounters().findCounter("org.apache.hadoop.mapreduce.TaskCounter", "REDUCE_INPUT_RECORDS");
System.out.println("Num of pairs sent to reducers in phase 1: " + counter.getValue());
Configuration conf2 = new Configuration();
Job job2 = Job.getInstance(conf2, "Phase 2");
job2.setJarByClass(Phase2.class);
job2.setMapperClass(Phase2.Mapper2.class);
job2.setPartitionerClass(Phase2.Partitioner2.class);
// job2.setCombinerClass(Phase2.Combiner2.class);
job2.setReducerClass(Phase2.Reducer2.class);
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(LongWritable.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(WritableLongPair.class);
job2.setNumReduceTasks(26);
// job2.setGroupingComparatorClass(Phase2.Comparator2.class);
FileInputFormat.addInputPath(job2, output1);
Path output2 = new Path(args[1] + "2");
FileOutputFormat.setOutputPath(job2, output2);
result = job2.waitForCompletion(true);
counter = job2.getCounters().findCounter("org.apache.hadoop.mapreduce.TaskCounter", "REDUCE_INPUT_RECORDS");
System.out.println("Num of pairs sent to reducers in phase 2: " + counter.getValue());
// System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
If you use the default output formatter (TextOutputFormat) Hadoop will call the toString() method on the object when it writes it to disk. This is expected behavior. The context.write() is being called, but its the output format that's controlling how the data appears on disk.
If you're chaining jobs together you would typically use SequenceFileInputFormat and SequenceFileOutputFormat for all of the jobs, since it makes reading the output from one job into a subsequent job easy.

Reading file in Java Hadoop

I am trying to follow a Hadoop tutorial on a website. I am trying to implement it in Java. The file that is provided is a file containing data about a forum. I want to parse that file and use the data.
The code to set my configurations is as follows:
public class ForumAnalyser extends Configured implements Tool{
public static void main(String[] args) {
int exitCode = 0;
try {
exitCode = ToolRunner.run(new ForumAnalyser(), args);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally {
System.exit(exitCode);
}
}
#Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(ForumAnalyser.class);
setStudentHourPostJob(conf);
JobClient.runJob(conf);
return 0;
}
public static void setStudentHourPostJob(JobConf conf) {
FileInputFormat.setInputPaths(conf, new Path("input2"));
FileOutputFormat.setOutputPath(conf, new Path("output_forum_post"));
conf.setJarByClass(ForumAnalyser.class);
conf.setMapperClass(StudentHourPostMapper.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setMapOutputKeyClass(LongWritable.class);
conf.setReducerClass(StudentHourPostReducer.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapOutputValueClass(IntWritable.class);
}
}
Each record in the file is separated by a "\n". So in the mapper class, each record is mostly correctly returned. Each column in every record is separated by tabs. The problem occurs for a specific column "posts". This column has the "posts" written by people and hence also contains "\n". So the mapper incorrectly reads a certain line under the "posts" column as a new record. Also, the "posts" column is specifically in double quotes in the file. The question that I have is:
1. How can I tell the mapper to differentiate each record correctly? Can I somehow tell it to read each column by tab? (I know how many columns each record has)?
Thanks in advance for the help.
By default, the MapReduce uses TextInputFormat, in which each record is a line of input (it assumes each record is delimited by new line ("\n")).
To achieve your requirements, you need to write your own InputFormat and RecordReader classes. For e.g. in Mahout, there is a XmlInputFormat for reading entire XML file as one record. Check the code here: https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java
I took the code for XmlInputFormat and modified it to achieve your requirements. Here is the code (I call it as MultiLineInputFormat and MultiLineRecordReader):
package com.myorg.hadooptests;
import com.google.common.io.Closeables;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
/**
* Reads records that are delimited by a specific begin/end tag.
*/
public class MultiLineInputFormat extends TextInputFormat {
private static final Logger log = LoggerFactory.getLogger(MultiLineInputFormat.class);
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
try {
return new MultiLineRecordReader((FileSplit) split, context.getConfiguration());
} catch (IOException ioe) {
log.warn("Error while creating MultiLineRecordReader", ioe);
return null;
}
}
/**
* MultiLineRecordReader class to read through a given text document to output records containing multiple
* lines as a single line
*
*/
public static class MultiLineRecordReader extends RecordReader<LongWritable, Text> {
private final long start;
private final long end;
private final FSDataInputStream fsin;
private final DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable currentKey;
private Text currentValue;
private static final Logger log = LoggerFactory.getLogger(MultiLineRecordReader.class);
public MultiLineRecordReader(FileSplit split, Configuration conf) throws IOException {
// open the file and seek to the start of the split
start = split.getStart();
end = start + split.getLength();
Path file = split.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(split.getPath());
fsin.seek(start);
log.info("start: " + Long.toString(start) + " end: " + Long.toString(end));
}
private boolean next(LongWritable key, Text value) throws IOException {
if (fsin.getPos() < end) {
try {
log.info("Started reading");
if(readUntilEnd()) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0, buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
return false;
}
#Override
public void close() throws IOException {
Closeables.closeQuietly(fsin);
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilEnd() throws IOException {
boolean insideColumn = false;
byte[] delimiterBytes = new String("\"").getBytes("utf-8");
byte[] newLineBytes = new String("\n").getBytes("utf-8");
while (true) {
int b = fsin.read();
// end of file:
if (b == -1) return false;
log.info("Read: " + b);
// We encountered a Double Quote
if(b == delimiterBytes[0]) {
if(!insideColumn)
insideColumn = true;
else
insideColumn = false;
}
// If we encounter a new line and we are not inside a columnt, it means end of record.
if(b == newLineBytes[0] && !insideColumn) return true;
// save to buffer:
buffer.write(b);
// see if we've passed the stop point:
if (fsin.getPos() >= end) {
if(buffer.getLength() > 0) // If buffer has some data, then return true
return true;
else
return false;
}
}
}
#Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return currentKey;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return currentValue;
}
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
currentKey = new LongWritable();
currentValue = new Text();
return next(currentKey, currentValue);
}
}
}
Logic:
I have assumed that the fields containing new lines ("\n") are delimited by double quotes (").
The record reading logic is in readUntilEnd() method.
In this method, if a new line appears and we are in the middle of reading a field (which is delimited by double quotes), we do not consider it as one record.
To test this, I wrote a Identity Mapper (which writes the input as-is to the output). In the driver, you explicitly specify the input format as your custom input format.
For e.g., I have specified the input format as:
job.setInputFormatClass(MultiLineInputFormat.class); // This is my custom class for InputFormat and RecordReader
Following is the code:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MultiLineDemo {
public static class MultiLineMapper
extends Mapper<LongWritable, Text , Text, NullWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(value, NullWritable.get());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "MultiLineMapper");
job.setInputFormatClass(MultiLineInputFormat.class);
job.setJarByClass(MultiLineDemo.class);
job.setMapperClass(MultiLineMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path("/in/in8.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
job.waitForCompletion(true);
}
}
I ran this on the following input. The input records match the output records exactly. You can see that 2nd field in each record, contains new lines ("\n"), but still entire record is returned in the output.
E:\HadoopTests\target>hadoop fs -cat /in/in8.txt
1 "post1 \n" 3
1 "post2 \n post2 \n" 3
4 "post3 \n post3 \n post3 \n" 6
1 "post4 \n post4 \n post4 \n post4 \n" 6
E:\HadoopTests\target>hadoop fs -cat /out/*
1 "post1 \n" 3
1 "post2 \n post2 \n" 3
1 "post4 \n post4 \n post4 \n post4 \n" 6
4 "post3 \n post3 \n post3 \n" 6
Note: I wrote this code for demo purpose. You need to handle the corner cases (if any) and optimize the code (if there is a scope for optimization).

Categories