I have my file to be processed is stored in HDFS in Binary Stream format.
Now I have to do some processing over the file using map-reduce.
The input file is split into no of blocks(The file is in the original format when it arrives the input block)
My question is when does this de-serialization occurs?
I have the writable interface implemented in my code and it has two methods i.e readFields and write. Is these methods are responsible for de serialization and serialization of actual data stored in HDFS?
If yes, Could you please explain the flow of data?
I'm stuck with this concept for the whole day, Please help..
Serialization occurs during write method on Context object in the mapper phase. In the code when you write context.write(key,value{own_object}), serialization starts. Once the map output is written to the local disk, SS will come into picture. In this phase the intermediate output will be processed by the framework. Here comes the de-serialization(using read()). You can see the serialized data after mapper.
Related
Say I have a large file with many objects already serialized (this is the easy part). I need to be able to have random access to the objects in the file when I go to deserialize. The only way I can think to do this would be to somehow store the file pointer to each object.
Basically I will end up with a large file of serialized objects and don't want to deserialize the entire file when I go to retrieve just one object.
Can anyone point me in the right direction on this one?
You can't. Serialization is called serialization for a reason. It is serial. Random access into a stream of objects will not work, for several reasons including the stream header, object handles, ...
Straight serialization will never be the solution you want.
The serial portion of the name means that the objects are written linearly to the ObjectOutputStream.
The serialization format is well known,
here is a link to the java 6 serialization format.
You have several options:
Unserialize the entire file and go from there.
Write code to read the serialized file and generate an index.
Maybe even store the index in a file for future use.
Abandon serialization to a file and store the objects in a database.
I´m currently working on a mapReduce job processing xml data and I think there´s something about the data flow in hadoop that I´m not getting correctly.
I´m running on Amazon´s ElasticMapReduce service.
Input data: large files (significantly above 64MB, so they should be splitable), consisting of a lot of small xml files that are concatenated by a previous s3distcp operation that concatenates all files into one.
I am using a slightly modified version of Mahout´s XmlInputFormat to extract the individual xml snippets from the input.
As a next step I´d like to parse those xml snippets into business objects which should then be passed to the mapper.
Now here is where I think I´m missing something: In order for that to work, my business objects need to implement the Writable interface, defining how to read/write an instance from/to an DataInput or DataOutput.
However, I don´t see where this comes into play - the logic needed to read an instance of my object is already in the InputFormat´s record reader, so why does the object have to be capable of reading/writing itself??
I did quite some research already and I know (or rather assume) WritableSerialization is used when transferring data between nodes in the cluster, but I´d like to understand the reasons behind that architecture.
The InputSplits are defined upon job submission - so if the name node sees that data needs to be moved to a specific node for a map task to work, would it not be sufficient to simply send the raw data as a byte stream? Why do we need to decode that into Writables if the RecordReader of our input format does the same thing anyway?
I really hope someone can show me the error in my thoughts above, many thanks in advance!
I am learning MapReduce. I'm trying as a test to set up a 'join' algorithm that takes in data from two files (which contain the two data sets to join).
For this to work, the mapper needs to know which file each line is from; this way, it can tag it appropriately, so that the reducer doesn't (for instance) join elements from one data set to other elements from the same set.
To complicate the matter, I am using Hadoop Streaming, and the mapper and reducer are written in Python; I understand Java, but the documentation for the Hadoop InputFormat and RecordReader classes are gloriously vague and I don't understand how I'd make a Streaming-compatible split so that some sort of file identifier could be bundled in along with the data.
Anyone who can explain how to set up this input processing in a way that my Python programs can understand?
I found out the answer, by the way— in Python, it's:
import os
context = os.environ["map_input_file"]
And 'context' then has the input file name.
I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.
To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).
Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.
I want to write multiple objects to a file, but the problem is that I dont have all the objects to write at once. I have to write one object and then close the file, and then maybe after sometime I want to add another object to the same file.
I am currently doing it as
FileOutputStream("filename", true)
so that it will append the object to the end of file and not overwrite it. But I get this error :
java.io.StreamCorruptedException: invalid type code: AC
any ideas how can I solve this issue ?
Thanks,
One option is to segment the file into individual messages. When you want to write a message, first serialize it to a ByteArrayOutputStream. Then open the file for appending with DataOutputStream - write the length with writeInt, then write the data.
When you're reading from the stream, you'd open it with DataInputStream, then repeatedly call readInt to find the length of the next message, then readFully to read the message itself. Put the message into ByteArrayInputStream and then deserialize from that.
Alternatively, use a nicer serialization format than the built-in Java serialization - I'm a fan of Protocol Buffers but there are lots of alternatives available. The built-in serialization is too brittle for my liking.
You can't append different ObjectOutputStreams to the same file. You would have to use a different form of serialization, or read the file in and write out all the objects plus the new objects to a new file.
You need to serialize/deserialize the List<T>. Take a look at this stackoverflow thread.