matrix computation using hadoop mapreduce - java

I have a matrix with around 10000 rows. I wrote a code that should take one row in each iteration, do some long matrix computations and return one double number per each row of matrix. Since the number of operation per each row is too much, running the code takes long time. I'm thinking to implement it using MapReduce but I'm not sure it is possible or not. The main idea is splitting matrix rows into different nodes, running the jobs independently and combining the outputs together and returns an a list of numbers. Based on my understanding, just a mapper can do this job. Am I right? Is it possible? or any better idea? Thanks in advance. By the way the code is in Java.

This seems possible - some points for consideration:
You might want to run an identity mapper (one which passes each input record to the reducer) and do the row calculation in the reducer. Doing the calculation map-side will probably still cause all the calculations to be done on a single node (it's feasible that your 10000 row matrix is smaller than the input split size).
You'll want to run a large number of reducers to ensure the job is parallellized across your cluster nodes. The default partitioner will handle sending the input rows to different reducers (assuming your rows are not fixed width, in which case you should run a custom mapper that uses a counter as the output keys, instead of the default byte offset of the input row).
To bring all the results back together you'll need to run a second MR job with a single reducer

Related

How to count input and output rows on the Spark SQL API from Java?

I am trying to count the number of rows that a Java process reads and writes. The process is using the SQL API dealing with Datasets of Row. Adding .count() at various points seems to slow it down a lot, even if I do a .persist() prior to those points.
I have also seen code that does a
.map(row -> {
accumulator.add(1);
return row;
}, SomeEncoder)
which works well enough but the deserialization and re-serialization of the whole row seems unnecessary and it isn't mentally automatic since one has to come up with the correct SomeEncoder at each point.
A third option is maybe to call a UDF0 that does the counting and then drop the dummy object it would return but I'm not sure if Spark would be allowed to optimize the whole code away if it can tell the UDF0 isn't changing the output.
Is there a good way of counting without deserializing the rows? Or alternatively, is there a method that does the equivalent of Java's streams' .peek() where the returned data isn't important?
EDIT: to clarify, the job isn't just counting. The counting is just for record-keeping purposes. The job is doing other things. In fact, this is a pretty generic problem, I've got lots of jobs that are doing some transformations on data and saving them somewhere, I just want to keep a running record of how many rows these jobs read and wrote.
Thank you

How to make a matrix multiplication faster and managable?

I am trying to multiply 2 large matrices in a most efficient way. Particularly, at one hand I have one matrix having dimensions (8.000 X 20.000) and on the other hand, I have the one with dimensions (35.000.000 X 20.000). Both these matrices have the columns with the identical values that is, 20.000 columns are in order and identical for both. Both matrices are too sparse and have boolean (binary) values. By multiplying them, I am trying to grab total commons for each row value.
I applied MATLAB to this end but it was not possible to multiply them due to out of memory issue. So I partitioned the larger matrix and made smaller chunks. Let's say 1.000.000 X 200.
After applying this separation process, I managed to multiply but it took about 5 hours to process even though in matlab this multiplying process is multi-threaded automatically.
I retrieved these matrices in my java script. I was wondering, if there might be faster way for the process. For example, would it make sense to apply Hadoop in java and do the process within java? Or is there any other suggestion?
Thanks in advance.

In Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?

I am using Hadoop to analyze a very uneven distribution of data. Some keys have thousands of values, but most have only one. For example, network traffic associated with IP addresses would have many packets associated with a few talkative IPs and just a few with most IPs. Another way of saying this is that the Gini index is very high.
To process this efficiently, each reducer should either get a few high-volume keys or a lot of low-volume keys, in such a way as to get a roughly even load. I know how I would do this if I were writing the partition process: I would take the sorted list of keys (including all duplicate keys) that was produced by the mappers as well as the number of reducers N and put splits at
split[i] = keys[floor(i*len(keys)/N)]
Reducer i would get keys k such that split[i] <= k < split[i+1] for 0 <= i < N-1 and split[i] <= k for i == N-1.
I'm willing to write my own partitioner in Java, but the Partitioner<KEY,VALUE> class only seems to have access to one key-value record at a time, not the whole list. I know that Hadoop sorts the records that were produced by the mappers, so this list must exist somewhere. It might be distributed among several partitioner nodes, in which case I would do the splitting procedure on one of the sublists and somehow communicate the result to all other partitioner nodes. (Assuming that the chosen partitioner node sees a randomized subset, the result would still be approximately load-balanced.) Does anyone know where the sorted list of keys is stored, and how to access it?
I don't want to write two map-reduce jobs, one to find the splits and another to actually use them, because that seems wasteful. (The mappers would have to do the same job twice.) This seems like a general problem: uneven distributions are pretty common.
I've been thinking about this problem, too. This is the high-level approach I would take if someone forced me.
In addition to the mapper logic you have in place to solve your business problem, code some logic to gather whatever statistics you'll need in the partitioner to distribute key-value pairs in a balanced manner. Of course, each mapper will only see some of the data.
Each mapper can find out its task ID and use that ID to build a unique filename in a specified hdfs folder to hold the gathered statistics. Write this file out in the cleanup() method which runs at the end of the task.
use lazy initialization in the partitioner to read all files in the specified hdfs directory. This gets you all of the statistics gathered during the mapper phase. From there you're left with implementing whatever partitioning logic you need to correctly partition the data.
This all assumes that the partitioner isn't called until all mappers have finished, but that's the best I've been able to do so far.
In best of my understanding - there is no single place in MR processing where all keys are present. More then this - there is no guarantee that single machine can store this data.
I think this problem does not have ideal solution in current MR framework. I think so because to have ideal solution - we have to wait for the end of last mapper and only then analyze key distribution and parametrize partitioner with this knowledge.
This approach will significantly complicate the system and raise latency.
I think good approximation might be to do random sampling over data to get the idea of the keys distribution and then make partiotioner to work according to it.
As far as I understand Terasort implementation is doing something very similar : http://sortbenchmark.org/YahooHadoop.pdf

Mapper and Reducer for K means algorithm in Hadoop in Java

I am trying to implement K means in hadoop-1.0.1 in java language. I am frustrated now. Although I got a github link of the complete implementation of k means but as a newbie in Hadoop, I want to learn it without copying other's code. I have basic knowledge of map and reduce function available in hadoop. Can somebody provide me the idea to implement k means mapper and reducer class. Does it require iteration?
Ok I give it a go to tell you what I thought when implementing k-means in MapReduce.
This implementation differs from that of Mahout, mainly because it is to show how the algorithm could work in a distributed setup (and not for real production usage).
Also I assume that you really know how k-means works.
That having said we have to divide the whole algorithm into three main stages:
Job level
Map level
Reduce level
The Job Level
The job level is fairly simple, it is writing the input (Key = the class called ClusterCenter and Value = the class called VectorWritable), handling the iteration with the Hadoop job and reading the output of the whole job.
VectorWritable is a serializable implementation of a vector, in this case from my own math library, but actually nothing else than a simple double array.
The ClusterCenter is mainly a VectorWritable, but with convenience functions that a center usually needs (averaging for example).
In k-means you have some seedset of k-vectors that are your initial centers and some input vectors that you want to cluster. That is exactly the same in MapReduce, but I am writing them to two different files. The first file only contains the vectors and some dummy key center and the other file contains the real initial centers (namely cen.seq).
After all that is written to disk you can start your first job. This will of course first launch a Mapper which is the next topic.
The Map Level
In MapReduce it is always smart to know what is coming in and what is going out (in terms of objects).
So from the job level we know that we have ClusterCenter and VectorWritable as input, whereas the ClusterCenter is currently just a dummy. For sure we want to have the same as output, because the map stage is the famous assignment step from normal k-means.
You are reading the real centers file you created at job level to memory for comparision between the input vectors and the centers. Therefore you have this distance metric defined, in the mapper it is hardcoded to the ManhattanDistance.
To be a bit more specific, you get a part of your input in map stage and then you get to iterate over each input "key value pair" (it is a pair or tuple consisting of key and value) comparing with each of the centers. Here you are tracking which center is the nearest and then assign it to the center by writing the nearest ClusterCenter object along with the input vector itself to disk.
Your output is then: n-vectors along with their assigned center (as the key).
Hadoop is now sorting and grouping by your key, so you get every assigned vector for a single center in the reduce task.
The Reduce Level
As told above, you will have a ClusterCenter and its assigned VectorWritable's in the reduce stage.
This is the usual update step you have in normal k-means. So you are simply iterating over all vectors, summing them up and averaging them.
Now you have a new "Mean" which you can compare to the mean it was assigned before. Here you can measure a difference between the two centers which tells us about how much the center moved. Ideally it wouldn't have moved and converged.
The counter in Hadoop is used to track this convergence, the name is a bit misleading because it actually tracks how many centers have not converged to a final point, but I hope you can live with it.
Basically you are writing now the new center and all the vectors to disk again for the next iteration. In addition in the cleanup step, you are writing all the new gathered centers to the path used in the map step, so the new iteration has the new vectors.
Now back at the job stage, the MapReduce job should be done now. Now we are inspecting the counter of that job to get the number of how many centers haven't converged yet.
This counter is used at the while loop to determine if the whole algorithm can come to an end or not.
If not, return to the Map Level paragraph again, but use the output from the previous job as the input.
Actually this was the whole VooDoo.
For obvious reasons this shouldn't be used in production, because its performance is horrible. Better use the more tuned version of Mahout. But for educational purposes this algorithm is fine ;)
If you have any more questions, feel free to write me a mail or comment.

Hadoop: Is it possible to have in memory structures in map function and aggregate them?

I am currently reading a paper and i have come to a point were the writers say that they have some arrays in memory for every map task and when the map task ends, they output that array.
This is the paper that i am referring to : http://research.google.com/pubs/pub36296.html
This looks somewhat a bit non-mapreduce thing to do, but i am trying to implement this project and i have come to a point were this is the only solution. I have tried many ways to use the common map reduce philosophy, which is process each line and output a key-value pair, but in that way i have for every line of input many thousands of context writes and its takes a long time to write them. So my map task is a bottleneck. These context writes cost a lot.
If i do it their way, i will have managed to reduce the number of key-value pairs dramatically. So i need to find a way to have in memory structures for every map task.
I can define these structures as static in the setup function, but i can find a way to tell when the map tasks ends, so that i can output that structure. I know it sounds a bit weird, but it is the only way to work efficiently.
This is what they say in that paper
On startup, each mapper loads the set of split points to be
considered for each ordered attribute. For each node n ∈ N
and attribute X, the mapper maintains a table Tn,X of key-
value pairs.
After processing all input data, the mappers out-
put keys of the form n, X and value v, Tn,X [v]
Here are some edits after Sean's answer :
I am using a combiner in my job. The thing is that these context.write(Text,Text) commands in my map function, are really time consuming. My input is csv files or arff files. In every line there is an example. My examples might have up to thousands of attributes. I am outputting for every attribute, key-value pairs in the form <(n,X,u),Y>, where is the name of the node (i am building a decision tree), X is the name of the attribute, u is the value of the attribute and Y are some statistics in Text format. As you can tell, if i have 100,000 attributes, i will have to have 100,000 context.write(Text,Text) commands for every example. Running my map task without these commands, it runs like the wind. If i add the context.write command, it takes forever. Even for a 2,000 thousand attribute training set. It really seems like i am writing in files and not in memory. So i really need to reduce those writes. Aggregating them in memory (in map function and not in the combiner) is necessary.
Adding a different answer since I see the point of the question now I think.
To know when the map task ends, well, you can override close(). I don't know if this is what you want. If you have 50 mappers, the 1/50th of the input each sees is not known or guaranteed. Is that OK for your use case -- you just need each worker to aggregate stats in memory for what it has seen and output?
Then your procedure is fine but probably would not make your in-memory data structure static -- nobody said two Mappers won't run in one JVM classloader.
A more common version of this pattern plays out in the Reducer where you need to collect info over some known subset of the keys coming in before you can produce one record. You can use a partitioner, and the fact the keys are sorted, to know you are seeing all of that subset on one worker, and, can know when it's done because a new different subset appears. Then it's pretty easy to collect data in memory while processing a subset, output the result and clear it when a new subset comes in.
I am not sure that works here since the bottleneck happens before the Reducer.
Without knowing a bit more about the details of what you are outputting, I can't be certain if this will help, but, this sounds like exactly what a combiner is designed to help with. It is like a miniature reducer (in fact, a combiner implementation is just another implementation of Reducer) attached to the output of a Mapper. Its purpose is to collect map output records in memory, and try to aggregate them before being written to disk and then collected by the Reducer.
The classic example is counting values. You can output "key,1" from your map and then add up the 1s in a reducer, but, this involves outputting "key,1" 1000 times from a mapper if the key appears 1000 times when "key,1000" would suffice. A combiner does that. Of course it only applies when the operation in question is associative/commutative and can be run repeatedly with no side effect -- addition is a good example.
Another answer: in Mahout we implement a lot of stuff that is both weird, a bit complex, and very slow if done the simple way. Pulling tricks like collecting data in memory in a Mapper is a minor and sometimes necessary sin, so, nothing really wrong with it. It does mean you really need to know the semantics that Hadoop guarantees, test well, and think about running out of memory if not careful.

Categories