I am learning MapReduce. I'm trying as a test to set up a 'join' algorithm that takes in data from two files (which contain the two data sets to join).
For this to work, the mapper needs to know which file each line is from; this way, it can tag it appropriately, so that the reducer doesn't (for instance) join elements from one data set to other elements from the same set.
To complicate the matter, I am using Hadoop Streaming, and the mapper and reducer are written in Python; I understand Java, but the documentation for the Hadoop InputFormat and RecordReader classes are gloriously vague and I don't understand how I'd make a Streaming-compatible split so that some sort of file identifier could be bundled in along with the data.
Anyone who can explain how to set up this input processing in a way that my Python programs can understand?
I found out the answer, by the way— in Python, it's:
import os
context = os.environ["map_input_file"]
And 'context' then has the input file name.
Related
I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.
As we know, the number of mapper is defined by the data splits, then the problem comes, if I want to implement a random forest algorithm with MapReduce, where each mapper requires all the data. What should I do within that case? Could we "reuse" the data for different mappers?
Could setNumMapTasks works? I am quite confused about that function, and I could hardly find any information regarding how it works against the natural number of mappers decided by the number of data splits.
Thank you so much.
Side data is data shared by all mappers. You will want to broadcast the data to the mappers as part of the Job setup.
This is accomplished via the DistributedCache https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/filecache/DistributedCache.html .
Here ar some code starting points. First place the files you want to share within the DistributedCache via the Job class:
job.addCacheFile(new URI("<your file location>"));
In the mapper/ reducer you access the file via normal FileSystem api:
File file = new File("<my file name>");
I´m currently working on a mapReduce job processing xml data and I think there´s something about the data flow in hadoop that I´m not getting correctly.
I´m running on Amazon´s ElasticMapReduce service.
Input data: large files (significantly above 64MB, so they should be splitable), consisting of a lot of small xml files that are concatenated by a previous s3distcp operation that concatenates all files into one.
I am using a slightly modified version of Mahout´s XmlInputFormat to extract the individual xml snippets from the input.
As a next step I´d like to parse those xml snippets into business objects which should then be passed to the mapper.
Now here is where I think I´m missing something: In order for that to work, my business objects need to implement the Writable interface, defining how to read/write an instance from/to an DataInput or DataOutput.
However, I don´t see where this comes into play - the logic needed to read an instance of my object is already in the InputFormat´s record reader, so why does the object have to be capable of reading/writing itself??
I did quite some research already and I know (or rather assume) WritableSerialization is used when transferring data between nodes in the cluster, but I´d like to understand the reasons behind that architecture.
The InputSplits are defined upon job submission - so if the name node sees that data needs to be moved to a specific node for a map task to work, would it not be sufficient to simply send the raw data as a byte stream? Why do we need to decode that into Writables if the RecordReader of our input format does the same thing anyway?
I really hope someone can show me the error in my thoughts above, many thanks in advance!
Im a newbie to hadoop. I did the setup and executed the basic word count java program. The results look good.
My question is it possible to parse an extremely big log file to fetch only a few required lines using the map/reduce classes? Or is some other step required?
Any pointers in this direction will be very useful.
Thanks, Aarthi
Yes it is entirely possible, and if the file is sufficiently large, I believe hadoop could prove to good way to tackle it, despite what nhahtdh says.
Your mappers could simply act as the filters - check the values passed to them, and only if they fit the conditions of a required line do you context.write() it out.
You wont even need to write your own reducer, just use the default reduce() in the Reducer class.
I am trying to Read HBase table TableMapReduceUtil and dump data into HDFS (Don't ask me why. It is weired but don't have any other option). So, to achieve that, I want to manipulate final file names (emitted by reducer) w.r.t the reducer key.
On the mapper side I was able to dump hbase rotryingws to HDFS in default order. But to override reducer outputfile format (name as per key), I figured out that MultipleOutputFormat class for reducer (which is absent on 0.20 due to some interface mess up, read somewhere) and the old one takes only JobConf. But if I try to write the code with old JobConf, I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class
Doesn't have much handson with Hadoop/HBase. Had spent some time modifying existing MRJObs.
It seems I am stuck with my approach.
Versions Hadoop-Core-0.20.;HBase 0.90.1
Thanks
Pankaj
I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class.
There are org.apache.hadoop.hbase.mapred.TableMapReduceUtil and org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil classes. The first will take JobConf (old MR API) and the second will take Job (new MR API). Use the appropriate TableMapReduceUtil class.