Sorting by value in Hadoop from a file

Sorting by value in Hadoop from a file - java

I have a file containing a String, then a space and then a number on every line.
Example:
Line1: Word 2
Line2 : Word1 8
Line3: Word2 1
I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format:
Line1: Word1 8 1
Line2: Word 2 2
Line3: Word2 1 3
Does anyone has an idea, how can I do it in Hadoop?
I am using java with Hadoop.

You could organize your map/reduce computation like this:
Map input: default
Map output: "key: number, value: word"
_ sorting phase by key _
Here you will need to override the default sorter to sort in decreasing order.
Reduce - 1 reducer
Reduce input: "key: number, value: word"
Reduce output: "key: word, value: (number, rank)"
Keep a global counter. For each key-value pair add the rank by incrementing the counter.
Edit: Here is a code snipped of a custom descendant sorter:
public static class IntComparator extends WritableComparator {
public IntComparator() {
super(IntWritable.class);
}
#Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
Integer v1 = ByteBuffer.wrap(b1, s1, l1).getInt();
Integer v2 = ByteBuffer.wrap(b2, s2, l2).getInt();
return v1.compareTo(v2) * (-1);
}
}
Don't forget to actually set it as the comparator for your job:
job.setSortComparatorClass(IntComparator.class);

Hadoop Streaming - Hadoop 1.0.x
According to this, after the
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.*.jar
you add a comparator
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
you specify the kind of sorting you want
-D mapred.text.key.comparator.options=-[ options]
where the [ options] are similar to Unix sort. Here are some examples,
Reverse order
-D mapred.text.key.comparator.options=-r
Sort on numeric values
-D mapred.text.key.comparator.options=-n
Sort on value or whatever field
-D mapred.text.key.comparator.options=-kx,y
with the -k flag you specify the key of sorting. The x, y parameters define this key. So, if you have a line with more than one tokens, you can choose which token of all will be the key of sorting or which combination of tokens will be the key of sorting. See the references for more details and examples.

I devised the solution to this problem. It was simple actually.
For sorting by value you need to use
setOutputValueGroupingComparator(Class)
For sorting in decreasing order you need to use setSortComparatorClass(LongWritable.DecreasingComparator.class);
For ranking you need to use
Counter class, getCounter and increment function.

Related

counting number of occurrences of words in a text java

So I'm building a TreeMap from scratch and I'm trying to count the number of occurrences of every word in a text using Java. The text is read from a text file, but I can easily read it from there. I really don't know how to count every word, can someone help?
Imagine the text is something like:
Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.
Output:
Over 1
time 1
computer 1
algotitms 5
...
If possible I want to ignore if it's upper or lower case, I want to count them both together.
EDIT: I don't want to use any sort of Map (hashMap i.e.) or something similiar to do this.

Break down the problem as follows (this is one potential solution - not THE solution):
Split the text into words (create list or array or words).
Remove punctuation marks.
Create your map to collect results.
Iterate over your list of words and add "1" to the value of each encountered key
Display results (Iterate over the map's EntrySet)
Split the text into words
My preference is to split words by using space as a delimiter. The reason being is that, if you split using non-word characters, you may missed on some hyphenated words. I know that the use of hyphenation is being reduced, there are still plenty of words that fall under this rule; for example, middle-aged. If a word such as this is encountered, it MIGHT have to be treated as one word and not two.
Remove punctuation marks
Because of the decision above, you will need to first remove punctuation characters that might attached to your words. Keep in mind that if you use a regular expression to split the words, you might be able to accomplish this step at the same time you are doing the step above. In fact, that would be preferred so that you don't have to iterate over twice. Do both of these in a single pass. While you at it, call toLowerCase() on the input string to eliminate the ambiguity between capitalized words and lowercase words.
Create your map to collect results
This is where you are going to collect your count. Using the TreeMap implementation of the Java Map. One thing to be aware about this particular implementation is that the map is sorted according to the natural ordering of its keys. In this case, since the keys are the words from the inputted text, the keys will be arranged in alphabetical order, not by the magnitude of the count. IF sorting the entries by count is important, there is a technique where you can "reverse" the map and make the values the keys and the keys to values. However, since two or more words could have the same count, you will need to create a new map of <Integer, Set>, so that you can group together words with the same count.
Iterate over your list of words
At this point, you should have a list of words and a map structure to collect the count. Using a lambda expression, you should be able to perform a count() or your words very easily. But, if you are not familiarized or comfortable with Lambda expressions, you can use a regular looping structure to iterate over your list, do a containsKey() check to see if the word was encountered before, get() the value if the map already contains the word, and then add "1" to the previous value. Lastly, put() the new count in the map.
Display results
Again, you can use a Lambda Expression to print out the EntrySet key value pairs or simply iterate over the entry set to display the results.
Based on all of the above points, a potential solution should look like this (not using Lambda for the OPs sake)
public static void main(String[] args) {
String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
text = text.replaceAll("\\p{P}", ""); // replace all punctuations
text = text.toLowerCase(); // turn all words into lowercase
String[] wordArr = text.split(" "); // create list of words
Map<String, Integer> wordCount = new TreeMap<>();
// Collect the word count
for (String word : wordArr) {
if(!wordCount.containsKey(word)){
wordCount.put(word, 1);
} else {
int count = wordCount.get(word);
wordCount.put(word, count + 1);
}
}
Iterator<Entry<String, Integer>> iter = wordCount.entrySet().iterator();
System.out.println("Output: ");
while(iter.hasNext()) {
Entry<String, Integer> entry = iter.next();
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
This produces the following output
Output:
advantage: 1
algorithms: 5
and: 1
combine: 1
computer: 1
each: 1
engineers: 1
even: 1
for: 2
in: 1
invent: 1
more: 1
new: 1
of: 2
other: 2
others: 1
over: 1
producing: 1
results: 2
take: 1
the: 1
things: 1
time: 1
to: 1
turn: 1
utilize: 1
with: 1
work: 1
Why did I break down the problem like this for such mundane task? Simple. I believe each of those discrete steps should be extracted into functions to improve code reusability. Yes, it is cool to use a Lambda expression to do everything at once and make your code look much simplified. But what if you need to some intermediate step over and over? Most of the time, code is duplicated to accomplish this. In reality, often a better solution is to break these tasks into methods. Some of these tasks, like transforming the input text, can be done in a single method since that activity seems to be related in nature. (There is such a thing as a method doing "too little.")
public String[] createWordList(String text) {
return text.replaceAll("\\p{P}", "").toLowerCase().split(" ");
}
public Map<String, Integer> createWordCountMap(String[] wordArr) {
Map<String, Integer> wordCountMap = new TreeMap<>();
for (String word : wordArr) {
if(!wordCountMap.containsKey(word)){
wordCountMap.put(word, 1);
} else {
int count = wordCountMap.get(word);
wordCountMap.put(word, count + 1);
}
}
return wordCountMap;
}
String void displayCount(Map<String, Integer> wordCountMap) {
Iterator<Entry<String, Integer>> iter = wordCountMap.entrySet().iterator();
while(iter.hasNext()) {
Entry<String, Integer> entry = iter.next();
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
Now, after doing that, your main method looks more readable and your code is more reusable.
public static void main(String[] args) {
WordCount wc = new WordCount();
String text = "...";
String[] wordArr = wc.createWordList(text);
Map<String, Integer> wordCountMap = wc.createWordCountMap(wordArr);
wc.displayCount(wordCountMap);
}
UPDATE:
One small detail I forgot to mention is that, if instead of a TreeMap a HashMap is used, the output will come sorted by count value in descending order. This is because the hashing function will use value of the entry as the hash. Therefore, you won't need to "reverse" the map for this purpose. So, after switching to HashMap, the output should be as follows:
Output:
algorithms: 5
other: 2
for: 2
turn: 1
computer: 1
producing: 1
...

my suggestion is to use regexp and split and stream with grouping example 3
EX1 this solution does not use a collection LIST/MAP only array for me it is not optimal
#Test
public void testApp2() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
final String lowerText = text.toLowerCase();
final String[] split = lowerText.split("\\W+");
System.out.println("Output: ");
for (String s : split) {
if (s == null) {
continue;
}
int count = 0;
for (int i = 0; i < split.length; i++) {
final boolean sameWorld = s.equals(split[i]);
if (sameWorld) {
count = count + 1;
split[i] = null;
}
}
System.out.println(s + " " + count);
}
}
EX2 I think that's what you mean, but I'm not sure if I used too much for the list
#Test
public void testApp() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
final String[] split = text.split("\\W+");
final List<String> list = new ArrayList<>();
System.out.println("Output: ");
for (String s : split) {
if(!list.contains(s)){
list.add(s.toUpperCase());
final long count = Arrays.stream(split).filter(s::equalsIgnoreCase).count();
System.out.println(s+" "+count);
}
}
}
EX3 below is a test for your example but use MAP
#Test
public void test() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
Map<String, Long> result = Arrays.stream(text.split("\\W+")).collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
assertEquals(result.get("algorithms"), new Long(5));
System.out.println("Output: ");
result.entrySet().stream().forEach(x -> System.out.println(x.getKey() + " " + x.getValue()));
}

Java - Search performantly for subset of String in String list

I want to search through a list of Strings and return the values, which contains which contain the search string.
The list could look like this (can have up to 1000 entries). Although it is not guranteed that it is always letters and then a digit. It could be digits only, words only or even both mixed up:
entry 1
entry 2
entry 3
entry 4
test 1
test 2
test 3
tst 4
If the user does search for 1, these should be returned:
entry 1
test 1
The situation is that the user has a search bar and can enter a search string. This search string is used to search through the list.
How can this be done performantly?
Currently, I have got:
for (String s : strings) {
if (s.contains(searchedText)) result.add(s);
}
It is O(N) and really slow. Especially if the user types many characters at a time.

Maybe I don't understand your question, but as you know n Java, String objects are immutable, but also can represent collection(array) of chars. So one thing what you can do is to perform search with better algorithms as binary_search, Aho-Corasick, Rabin–Karp, Boyer–Moore string search, StringSearch or one of these. Also you may consider some usage of Abstract_data_types with better performance (hashing, trees etc.).

This is very simple if you use streams:
final List<String> items = Arrays.asList("entry 1", "entry 2", "entry 3", "test 1", "test 2", "test 3");
final String searchString = "1";
final List<String> results = items.parallelStream() // work in parallel
.filter(s -> s.contains(searchString)) // pick out items that match
.collect(Collectors.toList()); // and turn those into a result list
results.forEach(System.out::println);
Notice the parallelStream() which will cause the list to be filtered and traversed using all available CPUs.
In your case you can use the results when the user expands the search term (while typing) to reduce the amount of items that need to be filtered, because if 's' matches all items in result, all those that match 'se' will be a sub-list of result.

If you don't use any additional structures, you cannot perform faster, than look though your data. That takes O(N).
If you can do some preparations, like building text index, you can increase performance of search. General information: http://en.wikipedia.org/wiki/Full_text_search. If you can make some assumptions about your data (like the last symbol is number and you are going to search only by it), it'll be easy to create such index.

Depending on the upper limit of the number in the string and if you have no concerns about space, use an Array of ArrayLists where the array index is the number of the string:
ArrayList<String>[] data = new ArrayList<String>[1000];
for ( int i = 0; i < 1000; i++ )
data[i] = new ArrayList<String>();
//inserting data
int num = Integer.parseInt(datastring.substring(datastring.length-1));
data[i].add(datastring);
//getting all data that has a 1
for ( String s: data[1] )
result.add(s);
Using a Hashmap would overwrite previous mapped values when trying to put new values into it.
i.e. if 1 maps to entry, then you try to add 1 mapping to test, the entry would get replaced with test.
As another idea, you could just keep a count of the number of strings with each number, so when you're searching, you know how many to look for, so as soon as you find all of them, you stop searching:
int[] str_count = new int[1000];
for ( int i = 0; i < 1000; i++ )
str_count[i] = 0;
//when storing data into the list:
int num = Integer.parseInt(datastring.substring(datastring.length-1));
str_count[i]++;
//when searching the list for 1s:
int count = str_count[1];
for (String s : strings) {
if (s.contains(searchedText))
result.add(s);
if (result.size() == count)
break;
}
While the first idea would be much faster, it would take up more space. Yet, the second idea takes up less space, the worst case scenario would search O(N) still.

search elements in an array in java

I'm wondering what kind method should I use to search the elements in an array and what data structure to store the return value
For example a txt file contains following
123 Name line Moon night table
124 Laugh Cry Dog
123 quote line make pet table
127 line array hello table
and the search elements are line+table
I read every line as an string and then spilt by space
the output should like this
123 2 (ID 123 occurs twice that contains the search elements)
127 1
I want some suggestions of what kind method to search the elements in the array and what kind data structure to store the return value (the ID and the number of occurs. I'm thinking hashmap)

Read the text file and store each line that ends with table in ArrayList<String>. Then use contains for each element in ArrayList<String>. Store result in HashMap<key,value> where key is ID and value is Integer which represent number of times ID occurs.

First, I would keep reading through the file line by line, there's really no other way of going about it other than that.
Second, to pick out the rows to save, you don't need to do the split (assumption: they all end in (space)table). You can just get them by using:
if (line.endsWith(" table"))
Then, I would suggest using a Map<String, Integer> datatype to store your information. This way, you have the number of the table (key) and how many times if was found in the file (value).
Map<String, Integer> map = new HashMap<String, Integer>();
....reading file....
if (line.endsWith(" table")) {
String number = line.substring(0, line.indexOf(" "))
if (!map.containsKey(number)) {
map.put(number, 1);
} else {
Integer value = map.get(number);
value++;
map.put(number, value);
}
}

intersection of two strings using Java HashSet

I am trying to learn Java by doing some assignments from a Stanford class and am having trouble answering this question.
boolean stringIntersect(String a, String b, int len): Given 2 strings,
consider all the substrings within them of length len. Returns true if
there are any such substrings which appear in both strings. Compute
this in O(n) time using a HashSet.
I can't figure out how to do it using a Hashset because you cannot store repeating characters. So stringIntersect(hoopla, loopla, 5) should return true.
thanks!
Edit: Thanks so much for all your prompt responses. It was helpful to see explanations as well as code. I guess I couldn't see why storing substrings in a hashset would make the algorithm more efficient. I originally had a solution like :
public static boolean stringIntersect(String a, String b, int len) {
assert (len>=1);
if (len>a.length() || len>b.length()) return false;
String s1=new String(),s2=new String();
if (a.length()<b.length()){
s1=a;
s2=b;
}
else {
s1=b;
s2=a;
}
int index = 0;
while (index<=s1.length()-len){
if (s2.contains(s1.substring(index,index+len)))return true;
index++;
}
return false;
}

I'm not sure I understand what you mean by "you cannot store repeating characters" A hashset is a Set, so it can do two things: you can add value to it, and you can add values to it, and you can check if a value is already in it. In this case, the problem wants you to answer the question by storing strings, not chars, in the HashSet. To do this in java:
Set<String> stringSet = new HashSet<String>();
Try breaking this problem into two parts:
1. Generate all the substrings of length len of a string
2. Use this to solve the problem.
The hint for part two is:
Step 1: For the first string enter the substrings into a hashset
Step 2: For the second string, check the values in the hashset
Note (Advanced): this problem is poorly specified. Entering and checking strings in a hashtable is O the length of the string. For string a of length n you have O(n-k) substrings of length k. So for string a being a string of length n and string b being a string of length m you have O((n-k)*k+(m-k)*k) this is not really big Oh of n, since your running time for k = n/2 is O((n/2)*(n/2)) = O(n^2)
Edit: So what if you actually want to do this in O(n) (or perhaps O(n+m+k))? My belief is that the original homework was asking for something like the algorithm I described above. But we can do better. Whats more, we can do better and still make a HashSet the crucial tool for our algorithm. The idea is to perform our search using a "Rolling Hash." Wikipedia describes a couple: http://en.wikipedia.org/wiki/Rolling_hash, but we will implement our own.
A simple solution would be to XOR the values of the character hashes together. This could allow us to add a new char to the hash O(1) and remove one O(1) making computing the next hash trivial. But this simple algorithm wont work for two reasons
The character hashes may not provide sufficient entropy. Okay, we dont know if we will have this problem, but lets solve it anyways, just for fun.
We will hash permutations to the same value ... "abc" should not have the same hash as "cba"
To solve the first problem we can use an idea from AI, namely lets steel from Zobrist hashing. The idea is to assign every possible character a random value of a greater length. If we were using ASCI, we could easily create an array with all the ASCI characters, but that will run into problems when using unicode characters. The alternative is to assign values lazily.
object LazyCharHash{
private val map = HashMap.empty[Char,Int]
private val r = new Random
def lHash(c: Char): Int = {
val d = map.get(c)
d match {
case None => {
map.put(c,r.nextInt)
lHash(c)
}
case Some(v) => v
}
}
}
This is Scala code. Scala tends to be less verbose than Java, but still allows me to use Java collections, as such I will be using imperative style Scala through out. It wouldn't be that hard to translate.
The second problem can be solved aswell. First, instead of using a pure XOR, we combine our XOR with a shift, thus the hash function is now:
def fullHash(s: String) = {
var h = 0
for(i <- 0 until s.length){
h = h >>> 1
h = h ^ LazyCharHash.lHash(s.charAt(i))
}
h
}
Of-course, using fullHash wont give a performance advantage. It is just a specification
We need a way of using our hash function to store values in the HashSet (I promised we would use it). We can just create a wrapper class:
class HString(hash: Int, string: String){
def getHash = hash
def getString = string
override def equals(otherHString: Any): Boolean = {
otherHString match {
case other: HString => (hash == other.getHash) && (string == other.getString)
case _ => false
}
}
override def hashCode = hash
}
Okay, to make the hashing function rolling, we just have to XOR the value associated with the character we will no longer be using. To that just takes shifting that value by the appropriate amount.
def stringIntersect(a: String, b: String, len: Int): Boolean = {
val stringSet = new HashSet[HString]()
var h = 0
for(i <- 0 until len){
h = h >>> 1
h = h ^ LazyCharHash.lHash(a.charAt(i))
}
stringSet.add(new HString(h,a.substring(0,len)))
for(i <- len until a.length){
h = h >>> 1
h = h ^ (LazyCharHash.lHash(a.charAt(i - len)) >>> (len))
h = h ^ LazyCharHash.lHash(a.charAt(i))
stringSet.add(new HString(h,a.substring(i - len + 1,i + 1)))
}
...
You can figure out how to finish this code on your own.
Is this O(n)? Well, it matters what mean. Big Oh, big Omega, big Theta, are all metrics of bounds. They could serve as metrics of the worst case of the algorithm, the best case, or something else. In this case these modification gives expected O(n) performance, but this only holds if we avoid hash collisions. It still take O(n) to tell if two Strings are equals. This random approach works pretty well, and you can scale up the size of the random bit arrays to make it work better, but it does not have guaranteed performance.

You should not store characters in the Hashset, but substrings.
When considering string "hoopla": if you store the substrings "hoopl" and "oopla" in the Hashset (linear operation), then it's linear again to find if one of the substrings of "loopla" matches.

I don't know how they're thinking you're supposed to use the HashSet but I ended up doing a solution like this:
public class StringComparator {
public static boolean compare( String a, String b, int len ) {
Set<String> pieces = new HashSet<String>();
for ( int x = 0; (x + len) <= b.length(); x++ ) {
pieces.add( a.substring( x, x + len ) );
}
for ( String piece : pieces ) {
if ( b.contains(piece) ) {
return true;
}
}
return false;
}
}

how to find duplicate and unique string entries using Hashtable

Assume I'm taking input a string from command line and I want to find the duplicate and unique entries in the string by using Hashtable.
eg:
i/p:
hi hello bye hi good hello name hi day hi
o/p:
Unique elements are: bye, good, name, day
Duplicate elements are:
hi 3 times
hello 2 times

You can break the input apart by calling split(" ") on the input String. This will return a String[] representing each word. Iterate over this array, and use each String as the key into your Hashtable, with the value being an Integer. Each time you encounter a word, either increment its value, or set the value to 0 if no value is currently there.
Hashtable<String, Integer> hashtable = new Hashtable<String, Integer>();
String[] splitInput = input.split(" ");
for(String inputToken : splitInput) {
Integer val = hashtable.get(inputToken);
if(val == null) {
val = new Integer(0);
}
++val;
hashtable.put(inputToken, val);
}
Also, you may want to look into HashMap rather than Hashtable. HashMap is not thread safe, but is faster. Hashtable is a bit slower, but is thread safe. If you are trying to do this in a single thread, I would recommend HashMap.

Use a hashtable with string as key and a numeric type as counter.
Go through all the words and if they are not in the map, insert them; otherwise increase the count (the data part of the hashtable).
hth
Mario

you can convert each string into an integer. Then, use the generated integer as the hash value. To convert string to int, you can treat it as a base 256 number and then convert it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Sorting by value in Hadoop from a file - java

Related

counting number of occurrences of words in a text java

Java - Search performantly for subset of String in String list

search elements in an array in java

intersection of two strings using Java HashSet

how to find duplicate and unique string entries using Hashtable

Categories

Resources