As we know, the number of mapper is defined by the data splits, then the problem comes, if I want to implement a random forest algorithm with MapReduce, where each mapper requires all the data. What should I do within that case? Could we "reuse" the data for different mappers?
Could setNumMapTasks works? I am quite confused about that function, and I could hardly find any information regarding how it works against the natural number of mappers decided by the number of data splits.
Thank you so much.
Side data is data shared by all mappers. You will want to broadcast the data to the mappers as part of the Job setup.
This is accomplished via the DistributedCache https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/filecache/DistributedCache.html .
Here ar some code starting points. First place the files you want to share within the DistributedCache via the Job class:
job.addCacheFile(new URI("<your file location>"));
In the mapper/ reducer you access the file via normal FileSystem api:
File file = new File("<my file name>");
Related
In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.
I want to store my blobs outside of the database in files, however they are just random blobs of data and aren't directly linked to a file.
So for example I have a table called Data with the following columns:
id
name
comments
...
I can't just include a column called fileLink or something like that because the blob is just raw data. I do however want to store it outside of the database. I would love to create a file called 3.dat where 3 is the id number for that row entry. The only thing with this setup is that the main folder will quickly start to have a large number of files as the id is a flat folder structure and there will be OS file issues. And no the data is not grouped or structured, it's one massive list.
Is there a Java framework or library that will allow me to store and manage the blobs so that I can just do something like MyBlobAPI.saveBlob(id, data); and then do MyBlobAPI.getBlob(id) and so on? In other words something where all the File IO is handled for me?
Simply use an appropriate database which implements blobs as you described, and use JDBC. You really are not looking for another API but a specific implementation. It's up to the DB to take care of effective storing of blobs.
I think a home rolled solution will include something like a fileLink column in your table and your api will create files on the first save and then write that file on update.
I don't know of any code base that will do this for you. There are a bunch that provide an in memory file system for java. But it's only a few lines of code to write something that writes and reads java objects to a file.
You'll have to handle any file system limitations yourself. Though I doubt you'll ever burn through the limitations of modern file systems like btrfs or zfs. FAT32 is limited to 65K files per directory. But even last generation file systems support something on the order of 4 billion files per directory.
So by all means, write a class with two functions. One to serialize an object to a file; given it a unique key as a name. And another to deserialize the object by that key. If you are using a modern file system, you'll never run out of resources.
As far as I can tell there is no framework for this. The closest I could find was Hadoop's HDFS.
That being said the advice of just putting the BLOB's into the database as per the answers below is not always advisable. Sometimes it's good and sometimes it's not, it really depends on your situation. Here are a few links to such discussions:
Storing Images in DB - Yea or Nay?
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
I did find some addition really good links but I can't remember them offhand. There was one in particular on StackOverFlow but I can't find it. If you believe you know the link please add it in the comments so that I can confirm it's the right one.
I´m currently working on a mapReduce job processing xml data and I think there´s something about the data flow in hadoop that I´m not getting correctly.
I´m running on Amazon´s ElasticMapReduce service.
Input data: large files (significantly above 64MB, so they should be splitable), consisting of a lot of small xml files that are concatenated by a previous s3distcp operation that concatenates all files into one.
I am using a slightly modified version of Mahout´s XmlInputFormat to extract the individual xml snippets from the input.
As a next step I´d like to parse those xml snippets into business objects which should then be passed to the mapper.
Now here is where I think I´m missing something: In order for that to work, my business objects need to implement the Writable interface, defining how to read/write an instance from/to an DataInput or DataOutput.
However, I don´t see where this comes into play - the logic needed to read an instance of my object is already in the InputFormat´s record reader, so why does the object have to be capable of reading/writing itself??
I did quite some research already and I know (or rather assume) WritableSerialization is used when transferring data between nodes in the cluster, but I´d like to understand the reasons behind that architecture.
The InputSplits are defined upon job submission - so if the name node sees that data needs to be moved to a specific node for a map task to work, would it not be sufficient to simply send the raw data as a byte stream? Why do we need to decode that into Writables if the RecordReader of our input format does the same thing anyway?
I really hope someone can show me the error in my thoughts above, many thanks in advance!
I am learning MapReduce. I'm trying as a test to set up a 'join' algorithm that takes in data from two files (which contain the two data sets to join).
For this to work, the mapper needs to know which file each line is from; this way, it can tag it appropriately, so that the reducer doesn't (for instance) join elements from one data set to other elements from the same set.
To complicate the matter, I am using Hadoop Streaming, and the mapper and reducer are written in Python; I understand Java, but the documentation for the Hadoop InputFormat and RecordReader classes are gloriously vague and I don't understand how I'd make a Streaming-compatible split so that some sort of file identifier could be bundled in along with the data.
Anyone who can explain how to set up this input processing in a way that my Python programs can understand?
I found out the answer, by the way— in Python, it's:
import os
context = os.environ["map_input_file"]
And 'context' then has the input file name.
I am trying to Read HBase table TableMapReduceUtil and dump data into HDFS (Don't ask me why. It is weired but don't have any other option). So, to achieve that, I want to manipulate final file names (emitted by reducer) w.r.t the reducer key.
On the mapper side I was able to dump hbase rotryingws to HDFS in default order. But to override reducer outputfile format (name as per key), I figured out that MultipleOutputFormat class for reducer (which is absent on 0.20 due to some interface mess up, read somewhere) and the old one takes only JobConf. But if I try to write the code with old JobConf, I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class
Doesn't have much handson with Hadoop/HBase. Had spent some time modifying existing MRJObs.
It seems I am stuck with my approach.
Versions Hadoop-Core-0.20.;HBase 0.90.1
Thanks
Pankaj
I am not able to Use HBase 0.90's TableMapReduceUtil which only takes Job class.
There are org.apache.hadoop.hbase.mapred.TableMapReduceUtil and org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil classes. The first will take JobConf (old MR API) and the second will take Job (new MR API). Use the appropriate TableMapReduceUtil class.