Trying to make sense of a basic WordCount MapReduce example - java

Started using Hadoop recently and struggling to make sense of a few things. Here is a basic WordCount example that I'm looking at (count the number of times each word appears):
Map(String docid, String text):
for each word term in text:
Emit(term, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Firstly, what is Emit(w,1) supposed to be doing? I notice that in all of the examples I look at the second parameter is always set to 1, but I can't seem to find an explanation on it.
Also, just to clarify - am I correct in saying that term is the key and sum in Reduce form the key-value pairs (respectively)? If this is the case, is values simply a list of 1's for each term that got emitted from Map? That's the only way I can make sense of it, but these are just assumptions.
Apologies for the noob question, I have looked at tutorials but a lot of the time I find that a lot of confusing terminology is used and overall basic things are made to be more complicated than they actually are so I'm struggling a little to make sense of this.
Appreciate any help!

Take this input as an example word count input.
Mapper will split this sentence into words.
Take,1
this,1
input,1
as,1
an,1
example,1
word,1
count,1
input,1
Then, the reducer receives "groups" of the same word (or key) and lists of the grouped values like so (and additionally sorts the keys, but that's not important for this example)
Take, (1)
this, (1)
input (1, 1)
etc...
As you can see, the key input has been "reduced" into a single element, that you can loop over and sum the values and emit like so
Take,1
this,1
input,2
etc...

Good question.
As explained, the mapper outputs a sequence of (key, value) pairs, in this case of the form (word, 1) for each word, which the reducer receives grouped as (key, <1,1,...,1>), sums up the terms in the list and returns (key, sum). Note that it is not the reducer who does the grouping; it's the map-reduce environment.
The map-reduce programming model is different from the one we're used to working in, and it's often not obvious how to implement an algorithm in this model. (Think, for example, about how would you implement a k-means clustering.)
I recommend Chapter 2 of the freely-available Mining of Massive Data Sets book by Leskovec et al. See also the corresponding slides.

Related

Transformation algorithms for numerical values similar to functionality of Soundex, Metaphone, etc

I'm working on implementing probablistic matching for person record searching. As part of this, I plan to have blocking performed before any scoring is done. Currently, there are a lot of good options for transforming strings so that they can be stored and then searched for, with similar strings matching each other (things like soundex, metaphone, etc).
However, I've struggled to find something similar for purely numeric values. For example, it would be nice to be able to block on a social security number and not have numbers that are off or have transposed digits be removed from the results. 123456789 should have blocking results for 123456780 or 213456789.
Now, there are certainly ways to simply compare two numerical values to determine how similar they are, but what could I do when there are million of numbers in the database? It's obviously impractical to compare them all (and that would certainly invalidate the point of blocking).
What would be nice would be something where those three SSNs above could somehow be transformed into some other value that would be stored. Purely for example, imagine those three numbers ended up as AAABBCCC after this magical transformation. However, something like 987654321 would be ZZZYYYYXX and 123547698 would be AAABCCBC or something like that.
So, my question is, is there a good transformation for numeric values like there exists for alphabetical values? Or, is there some other approach that might make sense (besides some highly complex or low performing SQL or logic)?
The first thing to realize is that social security numbers are basically strings of digits. You really want to treat them like you would strings rather than numbers.
The second thing to realize is that your blocking function maps from a record to a list of strings that identify comparison worthy sets of items.
Here is some Python code to get you started. (I know you asked for Java, but I think the Python is clear and you aren't paying me enough to write it in Java :P ). The basic idea is to take your input record, simulate roughing it up in multiple ways (to get your blocking keys), and then group on by any match on those blocking keys.
import itertools
def transpositions(s):
for pos in range(len(s) - 1):
yield s[:pos] + s[pos + 1] + s[pos] + s[pos + 2:]
def substitutions(s):
for pos in range(len(s)):
yield s[:pos] + '*' + s[pos+1:]
def all_blocks(s):
return itertools.chain([s], transpositions(s), substitutions(s))
def are_blocked_candidates(s1, s2):
return bool(set(all_blocks(s1)) & set(all_blocks(s2)))
assert not are_blocked_candidates('1234', '5555')
assert are_blocked_candidates('1234', '1239')
assert are_blocked_candidates('1234', '2134')
assert not are_blocked_candidates('1234', '1255')

Prevent treemap merging on collision

Edit: I should have probably mentioned that I am extremely new to Java programming. I just started with the language about two weeks ago.
I have tried looking for an answer to this questions, but so far I haven't found one so that is why I am asking it here.
I writing java code for an Dungeons and Dragons Initiative Tracker and I am using a TreeMap for its ability to sort on entry. I am still very new to java, so I don't know everything that is out there.
My problem is that when I have two of the same keys, the tree merges the values such that one of the values no longer exists. I understand this can be desirable behavior but in my case I cannot have that happen. I was hoping there would be an elegant solution to fix this behavior. So far what I have is this:
TreeMap<Integer,Character> initiativeList = new TreeMap<Integer,Character>(Collections.reverseOrder());
Character [] cHolder = new Character[3];
out.println("Thank you for using the Initiative Tracker Project.");
cHolder[0] = new Character("Fred",2);
cHolder[1] = new Character("Sam",3,23);
cHolder[2] = new Character("John",2,23);
for(int i = 0; i < cHolder.length; ++i)
{
initiativeList.put(cHolder[i].getInitValue(), cHolder[i]);
}
out.println("Initiative List: " + initiativeList);
Character is a class that I have defined that keeps track of a player's character name and initiative values.
Currently the output is this:
Initiative List: {23=John, 3=Fred}
I considered using a TreeMap with some sort of subCollection but I would also run into a similar problem. What I really need to do is just find a way to disable the merge. Thank you guys for any help you can give me.
EDIT: In Dungeons and Dragons, a character rolls a 20 sided dice and then added their initiative mod to the result to get their total initiative. Sometimes two players can get the same values. I've thought about having the key formatted like this:
Key = InitiativeValue.InitiativeMod
So for Sam his key would be 23.3 and John's would be 23.2. I understand that I would need to change the key type to float instead of int.
However, even with that two players could have the same Initiative Mod and roll the same Initiative Value. In reality this happens more than you might think. So for example,
Say both Peter and Scott join the game. They both have an initiative modifier of 2, and they both roll a 10 on the 20 sided dice. That would make both of their Initiative values 12.
When I put them into the existing map, they both need to show up even though they have the same value.
Initiative List: {23=John, 12=Peter, 12=Scott, 3=Fred}
I hope that helps to clarify what I am needing.
If I understand you correctly, you have a bunch of characters and their initiatives, and want to "invert" this structure to key by initiative ID, with the value being all characters that have that initiative. This is perfectly captured by a MultiMap data structure, of which one implementation is the Guava TreeMultimap.
There's nothing magical about this. You could achieve something similar with a
TreeMap<Initiative,List<Character>>
This is not exactly how a Guava multimap is implemented, but it's the simplest data structure that could support what you need.
If I were doing this I would write my own class that wrapped the above TreeMap and provided an add(K key, V value) method that handled the duplicate detection and list management according to your specific requirements.
You say you are "...a TreeMap for its ability to sort on entry..." - but maybe you could just use a TreeSet instead. You'll need to implement a suitable compareTo method on your Character class, that performs the comparison that you want; and I strongly recommend that you implement hashCode and equals too.
Then, when you iterate through the TreeSet, you'll get the Character objects in the appropriate order. Note that Map classes are intended for lookup purposes, not for ordering.

Recursive calculations using Mapreduce

I am working on map reduce program and was thinking about designing computations of the form where a1, b1 are the values associated with a key
a1/b1, a1+a2/b1+b2, a1+a2+a3/b1+b2+b3 ...
So at every stage of reducer I would require the previous values.
How would one design this as a map reduce as at every stage only the values associated with a particular key can be read.
If you feel the question is not clear, can you guide me towards this general question?
More general question: How would one develop a Fibonacci series using recursion in map reduce?
EDIT
Can you help me with my modified design
key1, V1,V2,V3
Key2, V4,V5,V6
Mapper output
Key1_X V1
Key1_Y V2
Key2_X V4
Key2_Y V5
Reducer output
Key1_X {V1,.....}
Key1_Y {V2,.....}
similarly, now in the next mapper stage. Can I create a list like this:
key1 {V1,....} {V2,....}
Key2 {V4,....} {V5,....}
My reason to do this, is to perform:
Key1 {V1/V2, V1+V6/V2+V7, V1+V6+..../V2+V7+.. , .........}
Is it possible to do this? Because the data set is very large, so I think it will be better to use map reduce.
Will changing the design help make it more efficient?
The main problem with Fibonacci (and as you indicated in your specific problem too) is the dependence between all terms in the series.
You cannot calculate the later terms without calculating the earlier terms first.
MapReduce is very good IFF you can split your job into independent pieces.
I don't see an easy way to do this.
So any construct "forcing" MapReduce to solve this will break the scalability advantages. Hence a simple highly optimized loop in your favorite programming language will outperform any MapReduce algorithm.
Write your mapper/reducer to calculate these three things:
the sum of a_i
the sum of b_i
their ratio

Help me understand question related to HashMap in Java

Im given a task which i am a little confused to understand. Here is the question statement:
The following program should read a file and store all its tokens in a member variable.
Your task is to write a single method that returns the number of items in tokenMap, the average length (as double value) of the elements in tokenMap, and the number of tokens starting with character "a".
Here the tokenMap is an object of type HashMap<String, Integer>;
I do have some idea about HashMap but what i want to know the "key value" for HashMap required is a single character or the whole word?? that i should store in tokenMap.
Also how can i compute the average length?
Looks like you have to use the entire word as the key.
The average length of tokens can be computed by summing the lengths of each token and dividing by the number of tokens.
In Java, you can find the number of tokens in the HashMap by tokenMap.size().
You can write loops that visit each member of the map like this:
for(String t: tokenMap.values()){
//t is a token
}
and if you look up String in the Java API docs you will see that it is easy to find the length of a String.
To compute the average length of the items in a hash map, you'll have to iterate over them all and count the length and calculate the average.
As for your other question about what to use for a key, how are we supposed to know? A hashmap can use practically any* value for a key.
*The value must be hashable, which is defined differently for different languages.
Reading the question closely, it seems that you have to read a file, extract each word and use it as the key value, and store the length of each key as the integer:
an example line
leads to a HashMap like this
an : 2
example : 7
line : 4
After you've built your map (made of keys mapping to entries, or seemingly elements in the question), you'll need to run some statistics over it to find
the number of keys (look at HashMap)
the average length of all keys (again, simple enough)
the number beginning with "a" (just look at the String)
Then make a value object containing these values and return it from the method that does the statistics.
I know I've given more information that you require, but someone else may benefit from a little extra help.
Guys there is some confusion. Im not asking for a solution. Im just confused for one thing.
For the time being, im gonna use String type as the key type.
The only confusion i have is once i read the file line by line, should i split it based upon words or based upon each character. So that the key value should be a single character type string or a String of whole word.
If you can go through the question statement, what do you suggest. That's all im asking.
should i split it based upon words or
based upon each character
The requirement is to make tokens, so you should split them based on words. Each word becomes a unique String key. It would make sense for the value to be the count of each token.
If the file you are reading has these three lines:
int alpha;
int beta;
float delta;
Then you should have something like
<"int", 2>
<";", 3>
<"alpha", 1>
<"beta", 1>
<"float", 1>
<"delta", 1>
(The semicolon may or may not be considered a token.)
Your average length would be ( 3x2 + 3x1 + 5 + 4 + 5 + 5) / 6.
Your length of tokens starting with "a" would be 5.0.
Look elsewhere on this forum for keySet and you should be good to go.

Anagram Hash Function

I know something like this has been asked before, but the answer was sort of side tracked.
I want to develop a hash function which will take a word and spit out an address of an array.
So, for example, if you input god:
sort the word, d o g
perform some sort of function on this to get an address d o g -> some number
insert 'dog' into address some_number in array[].
I can't seem to make a function which doesn't get screwed up somehow.
public static int hashCode(String word){
char[] x = word.toCharArray();
Arrays.sort(x);
int hash = 0;
for(int i =0; i<x.length; i++)
{
hash +=(x[i]-96)*(x[i]-96)*(x[i]-96)*(i+1)*(i+1)+i;
}
hash %=size; // get a value that's inside the bounds of the array
if(hash<0)
hash = hash + size;
return (hash);
}
This is my current algorithm but there are two problems.
the array size has the be huge so that there aren't a ton of collisions
there still are a few collisions, chair for example, produces: smarminess, parr, chair
What do you guys think? I really appreciate your help
Your hash function looks totally arbitrary. Why are you using that?
There are a few common, well known and relatively good hash functions, see a description here:
http://www.azillionmonkeys.com/qed/hash.html
See also https://stackoverflow.com/questions/263400#263416
There is a lot of research on hash functions and collision resolution. Here's a place to start: Hash Table
I guess that -- from your title and from the Arrays.sort(x) function -- that you're looking for a hash function that expressly collides when two strings are anagrams of each other. Is this correct? If so, you should specify that requirement INSIDE the question.
The article that Vinko suggested is good. I also recommend Integer Hash Function for other algorithms that you might try.
Good luck!
If you you really want to develop a "hash" that deliberately collides for all anagrams (in other words one that's amenable to finding anagrams in a hash table) then why not split the string into an array of characters, filter out any characters you want to ignore (non-letters) and sort the results, concatenate and then hash that string.
Thus "dog" and "god" both get munged into "dgo" and that's your key for all anagrams of "dog."
In modern versions of Python all that verbiage can be summarized in the following one-line function:
def anagrash(s):
return ''.join(sorted([x for x in s.lower() if s.isalpha()]))
... which you might use as:
anagrams = dict()
for each in phrases:
ahash = anagrash(each)
if ahash not in anagrams:
anagrams[ahash] = list()
anagrams[ahash].append(each)
... to build a dictionary of possible anagrams from a list of phrases.
Then, to filter out all of the phrases for which no anagram was found:
for key,val in anagrams:
if len(val) < 2:
del anagrams[key]
So, there's your homework assignment. Less than a dozen lines of Python. Porting that to whatever language your instructor is teaching and wrapping it in logic to read in the phrases and write out results is all left as an exercise to the student.

Categories