hadoop distinct count of a field

hadoop distinct count of a field - java

i have a file whose format is like below:
1,5321234567
1,5324564321
1,5324564321
2,1234567643
2,1234567666
2,9875422345
3,5344435345
3,5344435345
3,5344435345
3,5344435345
3,5345345312
3,8767564564
At the end of the reduce process, i want a distinct counts of the second field with the first field is the key. e.g.
1,2
2,3
3,3
What are the simplest map and reduce functions in Java for this purpose?
Tnx.

If I understand your goal correctly you'll need to :
Make the values per key unique
Count the number of distinct items per "key"
So the simplest way to get there would be something like this:
Assume the input is {A,B}
MAP 1:
Output Key : {A,B}
Output Value: 1
REDUCE 1:
Input Key : {A,B}
Input Values: {1,1,1,...}
Output Key : A
Output Value: B
MAP 2:
Output Key : A
Output Value: 1
REDUCE 2:
Input Key : A
Input Values: {1,1,1,...}
Output Key : A
Output Value: SUM of all the values

As I understand you need count of unique values for a key and not to preserve values.
It would we simple by creating key from record, rest hadoop framwork will take care of sorting unique records for you.
map (IntWritable key, Text value, Context context) {
context.write(value, new IntWritable(1));
}
reduce (Text key, Iterable<IntWritable> values, Context context) {
long count = 0;
for (Iterator<IntWritable> iterator = values.iterator(); iterator.hasNext();) {
count+= iterator.next().get();
}
context.write(key, new LongWritable(count));
}
Reducer can be used as combiner as well.

Just do Sorting. Get all the inputs in the arraylist and do sorting.
This would help you
Array

Related

comparing Hashmaps by different String Keys

i have two HashMaps and want compare it as fast as possible but the problem is, the String of mapA consist of two words connected with a space. The String of mapB is only one word.
I dont want to count the occurences, that is already done, i want to compare the two diferent Strings
mapA:
key: hello world, value: 10
key: earth hi, value: 20
mapB:
key: hello, value: 5
key: world, value: 15
key: earth, value: 25
key: hi, value: 35
the first key of mapA should find key "hello" and key "world" from mapB
what i trying to do is parsing a long Text to find Co occurences and set a value how often they occur related to all words.
my first try:
for(String entry : mapA.keySet())
{
String key = (String) entry;
Integer mapAvalue = (Integer) mapA.get(entry);
Integer tokenVal1=0, tokenVal2=0;
String token1=key.substring(0, key.indexOf(" "));
String token2=key.substring(key.indexOf(" "),key.length()).trim();
for( String mapBentry : mapb.keySet())
{
String tokenkey = mapBentry;
if(tokenkey.equals(token1)){
tokenVal1=(Integer)tokens.get(tokenentry);
}
if(tokenkey.equals(token2)){
tokenVal2=(Integer)tokens.get(tokenentry);
}
if(token1!=null && token2!=null && tokenVal1>1000 && tokenVal2>1000 ){
**procedurecall(mapAvalue, token1, token2, tokenVal1, tokenVal2);**
}
}
}

You shouldn't iterate over a HashMap (O(n)) if you are just trying to find a particular key, that's what the HashMap lookup (O(1)) is used for. So eliminate your inner loop.
Also you can eliminate a few unnecessary variables in your code (e.g. key, tokenkey). You also don't need a third tokens map, you can put the token values in mapb.
for(String entry : mapA.keySet())
{
Integer mapAvalue = (Integer) mapA.get(entry);
String token1=entry.substring(0, entry.indexOf(" "));
String token2=entry.substring(entry.indexOf(" "),entry.length()).trim();
if(mapb.containsKey(token1) && mapb.containskey(token2))
{
// look up the tokens:
Integer tokenVal1=(Integer)mapb.get(token1);
Integer tokenVal2=(Integer)mapb.get(token2);
if(tokenVal1>1000 && tokenVal2>1000)
{
**procedurecall(mapAvalue, token1, token2, tokenVal1, tokenVal2);**
}
}

Split a text file using delimiter \t and store in hashmap in java

I have a text file of n columns and n rows separated by tab space.. How do i use split function to store the columns and rows into a hashmap.Please help.My text file will be like..
Dept Id Name Contact
IT 1 zzz 678
ECE 2 ttt 789
IT 3 rrr 908
I tried the following.But it dint work.
Map<String,String> map=new HashMap<String,String>();
while(lineReader!=null)
{
String[] tokens = lineReader.split("\\t");
key = tokens[0];
values = tokens[1];
map.put(key , values );
System.out.println("ID:"+map.get(key ));
System.out.println("Other Column Values:"+map.get(values ));
}
This returns the key of the last entry(row) of the file and value as null. But i need to store all rows and columns in the map. How do i do it?

If I understand your data correctly,
After
String[] tokens = lineReader.split("\\t");
is processed on the first line, you'd have 4 tokens in the array.
I think you are using wrong logic, if you want to store the map in the following way:
IT -> (1 ZZZ 678)
.... etc then you need to process the data differently.
What you are storing in the map is follows:
IT -> 1
ECE -> 2
...
and so on.
That's why you get null when you are trying to do:
map.get(value);
What you should instead print is the Key and map.get(key).
Actually, in any case I don't think Map is what you want (but I don't know what you really want).
For now though, for your understanding of this problem try printing:
System.out.println("Total collumns: "+ tokens.length);
Updated:
This should work for you. It isn't the most elegant implementation for what you want, but gets the job done. You should try improving it from here on.
Map<String,String> map=new HashMap<String,String>();
while(lineReader!=null)
{
String[] tokens = lineReader.split("\\t");
key = tokens[1];
values = tokens[2]+tokens[3];
map.put(key , values );
System.out.println("ID:"+key);
System.out.println("Other Column Values:"+map.get(key));
}
Good luck!

search elements in an array in java

I'm wondering what kind method should I use to search the elements in an array and what data structure to store the return value
For example a txt file contains following
123 Name line Moon night table
124 Laugh Cry Dog
123 quote line make pet table
127 line array hello table
and the search elements are line+table
I read every line as an string and then spilt by space
the output should like this
123 2 (ID 123 occurs twice that contains the search elements)
127 1
I want some suggestions of what kind method to search the elements in the array and what kind data structure to store the return value (the ID and the number of occurs. I'm thinking hashmap)

Read the text file and store each line that ends with table in ArrayList<String>. Then use contains for each element in ArrayList<String>. Store result in HashMap<key,value> where key is ID and value is Integer which represent number of times ID occurs.

First, I would keep reading through the file line by line, there's really no other way of going about it other than that.
Second, to pick out the rows to save, you don't need to do the split (assumption: they all end in (space)table). You can just get them by using:
if (line.endsWith(" table"))
Then, I would suggest using a Map<String, Integer> datatype to store your information. This way, you have the number of the table (key) and how many times if was found in the file (value).
Map<String, Integer> map = new HashMap<String, Integer>();
....reading file....
if (line.endsWith(" table")) {
String number = line.substring(0, line.indexOf(" "))
if (!map.containsKey(number)) {
map.put(number, 1);
} else {
Integer value = map.get(number);
value++;
map.put(number, value);
}
}

Sorting by value in Hadoop from a file

I have a file containing a String, then a space and then a number on every line.
Example:
Line1: Word 2
Line2 : Word1 8
Line3: Word2 1
I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:
Line1: Word1 8 1
Line2: Word 2 2
Line3: Word2 1 3
Does anyone has an idea, how can I do it in Hadoop?
I am using java with Hadoop.

You could organize your map/reduce computation like this:
Map input: default
Map output: "key: number, value: word"
_ sorting phase by key _
Here you will need to override the default sorter to sort in decreasing order.
Reduce - 1 reducer
Reduce input: "key: number, value: word"
Reduce output: "key: word, value: (number, rank)"
Keep a global counter. For each key-value pair add the rank by incrementing the counter.
Edit: Here is a code snipped of a custom descendant sorter:
public static class IntComparator extends WritableComparator {
public IntComparator() {
super(IntWritable.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();
return v1.compareTo(v2) * (-1);
}
}
Don't forget to actually set it as the comparator for your job:
job.setSortComparatorClass(IntComparator.class);

Hadoop Streaming - Hadoop 1.0.x
According to this, after the
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
you add a comparator
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
you specify the kind of sorting you want
-D mapred.text.key.comparator.options=-[ options]
where the [ options] are similar to Unix sort. Here are some examples,
Reverse order
-D mapred.text.key.comparator.options=-r
Sort on numeric values
-D mapred.text.key.comparator.options=-n
Sort on value or whatever field
-D mapred.text.key.comparator.options=-kx,y
with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.

I devised the solution to this problem. It was simple actually.
For sorting by value you need to use
setOutputValueGroupingComparator(Class)
For sorting in decreasing order you need to use setSortComparatorClass(LongWritable.DecreasingComparator.class);
For ranking you need to use
Counter class, getCounter and increment function.

how to find duplicate and unique string entries using Hashtable

Assume I'm taking input a string from command line and I want to find the duplicate and unique entries in the string by using Hashtable.
eg:
i/p:
hi hello bye hi good hello name hi day hi
o/p:
Unique elements are: bye, good, name, day
Duplicate elements are:
hi 3 times
hello 2 times

You can break the input apart by calling split(" ") on the input String. This will return a String[] representing each word. Iterate over this array, and use each String as the key into your Hashtable, with the value being an Integer. Each time you encounter a word, either increment its value, or set the value to 0 if no value is currently there.
Hashtable<String, Integer> hashtable = new Hashtable<String, Integer>();
String[] splitInput = input.split(" ");
for(String inputToken : splitInput) {
Integer val = hashtable.get(inputToken);
if(val == null) {
val = new Integer(0);
}
++val;
hashtable.put(inputToken, val);
}
Also, you may want to look into HashMap rather than Hashtable. HashMap is not thread safe, but is faster. Hashtable is a bit slower, but is thread safe. If you are trying to do this in a single thread, I would recommend HashMap.

Use a hashtable with string as key and a numeric type as counter.
Go through all the words and if they are not in the map, insert them; otherwise increase the count (the data part of the hashtable).
hth
Mario

you can convert each string into an integer. Then, use the generated integer as the hash value. To convert string to int, you can treat it as a base 256 number and then convert it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

hadoop distinct count of a field - java

Just do Sorting. Get all the inputs in the arraylist and do sorting. This would help you Array

Related

comparing Hashmaps by different String Keys

Split a text file using delimiter \t and store in hashmap in java

search elements in an array in java

Sorting by value in Hadoop from a file

how to find duplicate and unique string entries using Hashtable

Categories

Resources