Recursive calculations using Mapreduce

Recursive calculations using Mapreduce - java

I am working on map reduce program and was thinking about designing computations of the form where a1, b1 are the values associated with a key
a1/b1, a1+a2/b1+b2, a1+a2+a3/b1+b2+b3 ...
So at every stage of reducer I would require the previous values.
How would one design this as a map reduce as at every stage only the values associated with a particular key can be read.
If you feel the question is not clear, can you guide me towards this general question?
More general question: How would one develop a Fibonacci series using recursion in map reduce?
EDIT
Can you help me with my modified design
key1, V1,V2,V3
Key2, V4,V5,V6
Mapper output
Key1_X V1
Key1_Y V2
Key2_X V4
Key2_Y V5
Reducer output
Key1_X {V1,.....}
Key1_Y {V2,.....}
similarly, now in the next mapper stage. Can I create a list like this:
key1 {V1,....} {V2,....}
Key2 {V4,....} {V5,....}
My reason to do this, is to perform:
Key1 {V1/V2, V1+V6/V2+V7, V1+V6+..../V2+V7+.. , .........}
Is it possible to do this? Because the data set is very large, so I think it will be better to use map reduce.
Will changing the design help make it more efficient?

The main problem with Fibonacci (and as you indicated in your specific problem too) is the dependence between all terms in the series.
You cannot calculate the later terms without calculating the earlier terms first.
MapReduce is very good IFF you can split your job into independent pieces.
I don't see an easy way to do this.
So any construct "forcing" MapReduce to solve this will break the scalability advantages. Hence a simple highly optimized loop in your favorite programming language will outperform any MapReduce algorithm.

Write your mapper/reducer to calculate these three things:
the sum of a_i
the sum of b_i
their ratio

Related

Trying to make sense of a basic WordCount MapReduce example

Started using Hadoop recently and struggling to make sense of a few things. Here is a basic WordCount example that I'm looking at (count the number of times each word appears):
Map(String docid, String text):
for each word term in text:
Emit(term, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Firstly, what is Emit(w,1) supposed to be doing? I notice that in all of the examples I look at the second parameter is always set to 1, but I can't seem to find an explanation on it.
Also, just to clarify - am I correct in saying that term is the key and sum in Reduce form the key-value pairs (respectively)? If this is the case, is values simply a list of 1's for each term that got emitted from Map? That's the only way I can make sense of it, but these are just assumptions.
Apologies for the noob question, I have looked at tutorials but a lot of the time I find that a lot of confusing terminology is used and overall basic things are made to be more complicated than they actually are so I'm struggling a little to make sense of this.
Appreciate any help!

Take this input as an example word count input.
Mapper will split this sentence into words.
Take,1
this,1
input,1
as,1
an,1
example,1
word,1
count,1
input,1
Then, the reducer receives "groups" of the same word (or key) and lists of the grouped values like so (and additionally sorts the keys, but that's not important for this example)
Take, (1)
this, (1)
input (1, 1)
etc...
As you can see, the key input has been "reduced" into a single element, that you can loop over and sum the values and emit like so
Take,1
this,1
input,2
etc...

Good question.
As explained, the mapper outputs a sequence of (key, value) pairs, in this case of the form (word, 1) for each word, which the reducer receives grouped as (key, <1,1,...,1>), sums up the terms in the list and returns (key, sum). Note that it is not the reducer who does the grouping; it's the map-reduce environment.
The map-reduce programming model is different from the one we're used to working in, and it's often not obvious how to implement an algorithm in this model. (Think, for example, about how would you implement a k-means clustering.)
I recommend Chapter 2 of the freely-available Mining of Massive Data Sets book by Leskovec et al. See also the corresponding slides.

Efficiency of guava Table vs multiple hash maps

I ran across some code that was doing something like this:
Map<String,String> fullNameById = buildMap1(dataSource1);
Map<String,String> nameById = buildMap2(dataSource2);
Map<String,String> nameByFullName = new HashMap<String,String>();
Map<String,String> idByName = new HashMap<String,String>();
Set<String> ids = fullNameById.keySet();
for (String nextId : ids) {
String name = nameById.get(nextId);
String fullName = fullNameById.get(nextId);
nameByFullName.put(fullName, name);
idByName.put(name, nextId);
}
I had to stare at it for several minutes to figure out what was going on. All of that amounts to a join operation on id's and an inversion of one of the original maps. Since Id, FullName and Name are always 1:1:1 it seemed to me that there should be some way to simplify this. I also discovered that the first two maps are never used again, and I find that the above code is a bit hard to read. So I'm considering replacing it with something like this that (to me) reads much cleaner
Table<String, String, String> relations = HashBasedTable.create();
addRelationships1(dataSource1, relations);
addRelationships2(dataSource2, relations);
Map<String,String> idByName = relations.column("hasId");
Map<String,String> nameByFullName = relations.column("hasName");
relations = null; // not used hereafter
In addRelationships1 I do
relations.put(id, "hasFullName", fullname);
And in addRelationships2 where my query yields values for id and name I do
relations.put(relations.remove(id,"hasFullName"), "hasName", name);
relations.put(name, "hasId", id);
So my questions are these:
Is there a lurking inefficiency in what I have done either via processor or memory, or GC load? I don't think so, but I'm not that familiar with the efficiency of Table. I am aware that the Table object won't be GC'd after relations = null, I just want to communicate that it's not used again in the rather lengthy section of code that follows.
Have I gained any efficiency? I keep convincing and unconvincing myself that I have and have not.
Do you find this more readable? Or is this only easy for me to read because I wrote it? I'm a tad worried on that front due to the fact Table is not well known. On the other hand, the top level now pretty clearly says, "gather data from two sources and make these two maps from it." I also like the fact that it doesn't leave you wondering if/where the other two maps are being used (or not).
Do you have an even better, cleaner, faster, simpler way to do it than either of the above?
Please Lets not have the optimize early/late discussion here. I'm well aware of that pitfall. If it improves readability without hurting performance I am satisfied. Performance gain would be a nice bonus.
Note: my variable and method names have been sanitized here to keep the business area from distracting from the discussion, I definitely won't name them addRelationships1 or datasource1! Similarly, the final code will of course use constants not raw strings.

So I did some mini benchmarking myself and came up with the conclusion that there is little difference in the two methods in terms of execution time. I kept the total size of the data being processed constant by trading runs for data-set size. I did 4 runs and chose the lowest time for each implementation from among all 4 runs. Re-reassuringly both implementations were always fastest on the same run. My code can be found here. Here are my results:
Case Maps (ms) Table (ms) Table vs Maps
100000 runs of size 10 2931 3035 104%
10000 runs of size 100 2989 3033 101%
1000 runs of size 1000 3129 3160 101%
100 runs of size 10000 4126 4429 107%
10 runs of size 100000 5081 5866 115%
1 run of size 1000000 5489 5160 94%
So using Table seems to be slightly slower for small data sets. Something interesting happens around 100,000 and then by 1 million the table is actually faster. My data will hang out in the 100 to 1000 range, so at least in execution time the performance should be nearly identical.
As for readability, my opinion is that if someone is trying to figure out what is happening near by and reads the code it will be significantly easier to see the intent. If they have to actually debug this bit of code it may be a bit harder since Table is less common, and requires some sophistication to understand.
Another thing I am unsure of is whether or not it's more efficient to create the hash maps, or to just query the table directly in the case where all keys of the map will subsequently be iterated. However that's a different question :)
And the comedic ending is that in fact as I analyzed the code further (hundreds of lines), I found that the only significant use of nameByFullname.get() outside of logging (of questionable value) was to pass the result of the to idByName.get(). So in the end I'll actually be building an idByFullName map and an idByName map instead with no need for any joining, and dropping the whole table thing anyway. But it made for an interesting SO question I guess.

tl;dr, but I'm afraid that you'd need to make a bigger step away from the original design. Simulating DB tables might be a nice exercise, but for me your code isn't really readable.
Is there a lurking inefficiency in what I have done... No idea.
Have I gained any efficiency? I'm afraid you need to measure it first. Removing some indirections surely helps, but using a more complicated data structure might offset it. And performance in general is simply too complicated.
Do you find this more readable? I'm afraid not.
Do you have an even better, cleaner, faster, simpler way to do it than either of the above? I hope so....
Where I get lost in such code is the use of strings for everything - it's just too easy to pass a wrong string as an argument. So I'd suggest to aggregate them into an object and provide maps for accessing the objects via any part of them. Something as trivial as this should do:
class IdNameAndFullName {
String id, name, fullName;
}
class IdNameAndFullNameMaps {
Map<String, IdNameAndFullName> byId;
Map<String, IdNameAndFullName> byName;
Map<String, IdNameAndFullName> byFullName;
}
You could obviously replace the class IdNameAndFullNameMaps by a Table. However, besides using a nice pre-existing data structure I see no advantages therein. The disadvantages are:
loss of efficiency
loss of readability (I wouldn't use Table here for the very same reason Tuple should be avoided)
use of String keys (your "hasId" and "hasName").

Prevent treemap merging on collision

Edit: I should have probably mentioned that I am extremely new to Java programming. I just started with the language about two weeks ago.
I have tried looking for an answer to this questions, but so far I haven't found one so that is why I am asking it here.
I writing java code for an Dungeons and Dragons Initiative Tracker and I am using a TreeMap for its ability to sort on entry. I am still very new to java, so I don't know everything that is out there.
My problem is that when I have two of the same keys, the tree merges the values such that one of the values no longer exists. I understand this can be desirable behavior but in my case I cannot have that happen. I was hoping there would be an elegant solution to fix this behavior. So far what I have is this:
TreeMap<Integer,Character> initiativeList = new TreeMap<Integer,Character>(Collections.reverseOrder());
Character [] cHolder = new Character[3];
out.println("Thank you for using the Initiative Tracker Project.");
cHolder[0] = new Character("Fred",2);
cHolder[1] = new Character("Sam",3,23);
cHolder[2] = new Character("John",2,23);
for(int i = 0; i < cHolder.length; ++i)
{
initiativeList.put(cHolder[i].getInitValue(), cHolder[i]);
}
out.println("Initiative List: " + initiativeList);
Character is a class that I have defined that keeps track of a player's character name and initiative values.
Currently the output is this:
Initiative List: {23=John, 3=Fred}
I considered using a TreeMap with some sort of subCollection but I would also run into a similar problem. What I really need to do is just find a way to disable the merge. Thank you guys for any help you can give me.
EDIT: In Dungeons and Dragons, a character rolls a 20 sided dice and then added their initiative mod to the result to get their total initiative. Sometimes two players can get the same values. I've thought about having the key formatted like this:
Key = InitiativeValue.InitiativeMod
So for Sam his key would be 23.3 and John's would be 23.2. I understand that I would need to change the key type to float instead of int.
However, even with that two players could have the same Initiative Mod and roll the same Initiative Value. In reality this happens more than you might think. So for example,
Say both Peter and Scott join the game. They both have an initiative modifier of 2, and they both roll a 10 on the 20 sided dice. That would make both of their Initiative values 12.
When I put them into the existing map, they both need to show up even though they have the same value.
Initiative List: {23=John, 12=Peter, 12=Scott, 3=Fred}
I hope that helps to clarify what I am needing.

If I understand you correctly, you have a bunch of characters and their initiatives, and want to "invert" this structure to key by initiative ID, with the value being all characters that have that initiative. This is perfectly captured by a MultiMap data structure, of which one implementation is the Guava TreeMultimap.
There's nothing magical about this. You could achieve something similar with a
TreeMap<Initiative,List<Character>>
This is not exactly how a Guava multimap is implemented, but it's the simplest data structure that could support what you need.
If I were doing this I would write my own class that wrapped the above TreeMap and provided an add(K key, V value) method that handled the duplicate detection and list management according to your specific requirements.

You say you are "...a TreeMap for its ability to sort on entry..." - but maybe you could just use a TreeSet instead. You'll need to implement a suitable compareTo method on your Character class, that performs the comparison that you want; and I strongly recommend that you implement hashCode and equals too.
Then, when you iterate through the TreeSet, you'll get the Character objects in the appropriate order. Note that Map classes are intended for lookup purposes, not for ordering.

What is a fast alternative to HashMap for mapping to primitive types?

First of all let me tell you that i have read the following questions that has been asked before Java HashMap performance optimization / alternative and i have a similar question.
What i want to do is take a LOT of dependencies from New york times text that will be processed by stanford parser to give dependencies and store the dependencies in a hashmap along with their scores, i.e. if i see a dependency twice i will increment the score from the hashmap by 1.
The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.
How will i be able to increase the performance of my hashmap? What kind of hashkey can i use?
Thanks a lot
Martinos
EDIT 1:
ok guys maybe i phrased my question wrongly ok , well the byte arrays are not used in MY project but in the similar question of another person above. I dont know what they are using it for hence thats why i asked.
secondly: i will not post code as i consider it will make things very hard to understand but here is a sample:
With sentence : "i am going to bed" i have dependencies:
(i , am , -1)
(i, going, -2)
(i,to,-3)
(am, going, -1)
.
.
.
(to,bed,-1)
These dependencies of all sentences(1 000 000 sentences) will be stored in a hashmap.
If i see a dependency twice i will get the score of the existing dependency and add 1.
And that is pretty much it. All is well but the rate of adding sentences in hashmap(or retrieving) scales down on this line:
dependancyBank.put(newDependancy, dependancyBank.get(newDependancy) + 1);
Can anyone tell me why?
Regards
Martinos

Trove has optimized hashmaps for the case where key or value are of primitive type.
However, much will still depend on smart choice of structure and hash code for your keys.
This part of your question is unclear: The task starts off really quickly, about 10 sentences a second but scales off quickly. At 30 000 sentences( which is assuming 10 words in each sentence and about 3-4 dependences for each word which im storing) is about 300 000 entries in my hashmap.. But you don't say what the performance is for the larger data. Your map grows, which is kind of obvious. Hashmaps are O(1) only in theory, in practice you will see some performance changes with size, due to less cache locality, and due to occasional jumps caused by rehashing. So, put() and get() times will not be constant, but still they should be close to that. Perhaps you are using the hashmap in a way which doesn't guarantee fast access, e.g. by iterating over it? In that case your time will grow linearly with size and you can't change that unless you change your algorithm.

Google 'fastutil' and you will find a superior solution for mapping object keys to scores.

Take a look at the Guava multimaps: http://www.coffee-bytes.com/2011/12/22/guava-multimaps They are designed to basically keep a list of things that all map to the same key. That might solve your need.

How will i be able to increase the performance of my hashmap?
If its taking more than 1 micro-second per get() or put(), you have a bug IMHO. You need to determine why its taking as long as it is. Even in the worst case where every object has the same hasCode, you won't have performance this bad.
What kind of hashkey can i use?
That depends on the data type of the key. What is it?
and finally what are byte[] a = new byte[2]; byte[] b = new byte[3]; in the question that was posted above?
They are arrays of bytes. They can be used as values to look up but its likely that you need a different value type.

An HashMap has an overloaded constructor which takes initial capacity as input. The scale off you see is because of rehashing during which the HashMap will virtually not be usable. To prevent frequent rehashing you need to start with a HashMap of greater initial capacity. You can also set a loading factor which indicates how much percentage do you load the hashes before rehashing.
public HashMap(int initialCapacity).
Pass the initial capacity to the HashMap during object construction. It is preferable to set a capacity to almost twice the number of elements you would want to add in the map during the course of execution of your program.

Programmatical approach in Java for file comparison

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub

For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).

A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.

There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf

As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.