Im given a task which i am a little confused to understand. Here is the question statement:
The following program should read a file and store all its tokens in a member variable.
Your task is to write a single method that returns the number of items in tokenMap, the average length (as double value) of the elements in tokenMap, and the number of tokens starting with character "a".
Here the tokenMap is an object of type HashMap<String, Integer>;
I do have some idea about HashMap but what i want to know the "key value" for HashMap required is a single character or the whole word?? that i should store in tokenMap.
Also how can i compute the average length?
Looks like you have to use the entire word as the key.
The average length of tokens can be computed by summing the lengths of each token and dividing by the number of tokens.
In Java, you can find the number of tokens in the HashMap by tokenMap.size().
You can write loops that visit each member of the map like this:
for(String t: tokenMap.values()){
//t is a token
}
and if you look up String in the Java API docs you will see that it is easy to find the length of a String.
To compute the average length of the items in a hash map, you'll have to iterate over them all and count the length and calculate the average.
As for your other question about what to use for a key, how are we supposed to know? A hashmap can use practically any* value for a key.
*The value must be hashable, which is defined differently for different languages.
Reading the question closely, it seems that you have to read a file, extract each word and use it as the key value, and store the length of each key as the integer:
an example line
leads to a HashMap like this
an : 2
example : 7
line : 4
After you've built your map (made of keys mapping to entries, or seemingly elements in the question), you'll need to run some statistics over it to find
the number of keys (look at HashMap)
the average length of all keys (again, simple enough)
the number beginning with "a" (just look at the String)
Then make a value object containing these values and return it from the method that does the statistics.
I know I've given more information that you require, but someone else may benefit from a little extra help.
Guys there is some confusion. Im not asking for a solution. Im just confused for one thing.
For the time being, im gonna use String type as the key type.
The only confusion i have is once i read the file line by line, should i split it based upon words or based upon each character. So that the key value should be a single character type string or a String of whole word.
If you can go through the question statement, what do you suggest. That's all im asking.
should i split it based upon words or
based upon each character
The requirement is to make tokens, so you should split them based on words. Each word becomes a unique String key. It would make sense for the value to be the count of each token.
If the file you are reading has these three lines:
int alpha;
int beta;
float delta;
Then you should have something like
<"int", 2>
<";", 3>
<"alpha", 1>
<"beta", 1>
<"float", 1>
<"delta", 1>
(The semicolon may or may not be considered a token.)
Your average length would be ( 3x2 + 3x1 + 5 + 4 + 5 + 5) / 6.
Your length of tokens starting with "a" would be 5.0.
Look elsewhere on this forum for keySet and you should be good to go.
Related
How can I parse natural strings like these:
"10 meters"
"55m"
Into instances of this class:
public class Units {
public String name; //will be "meters"
public int howMuch; //will be 10 or 55
}
P.S. I want to do this with NLP libraries, I'm really a noob in NLP and sorry for my bad english
It is possible, but I recommend you don't do this. An array usually holds only hold one type of data structures, so it cannot hold an int and a string at the same time. If you did do it, you would have to do Object[][]
You could use the following algorithm:
Separate the text into words by looping through each character and breaking off a new word each time you encounter a space: this can be stored in a String array. Make sure that each word is stored lowercase.
Store a 2-dimensional String array as a database of all the units you want to recognize: this could be done with each sub-array representing one unit and all its equivalent representations: for example, the sub-array for meters might look like {"meter","meters","m"}.
Make two parallel ArrayLists: the first representing all numerical values and the second representing their corresponding units.
Loop through the list of words from step 1: for each word, check if it is in the format nubmer+unit (without an adjoining space). If so, then split the number off and put it in the first ArrayList. Then, find the unabbreviated unit corresponding with the abbreviated unit given in the text by referring to the 2-dimensional string array (this should be the first index of the subarray). Add this unit to the second ArrayList. Finally, if the word is a single number, check if the next word corresponds with any of the units; if it does, then find its unabbreviated unit (the first index of the sub-array). Then add the number and its unit to their respective ArrayLists.
I have a certain String (from a Radio talkshow), which is an anagram with a length of 15.
What I want to do is to build all permutations efficiently and check them against a dictionary.
This way I want to find out the original word of the anagram.
I already wrote an alogithm, which is working by always merging one letter after the other in the already known permutations.
It is working, but it is too slow. There is never any result shown with 15 chracters (no wonder with 15! possibilities).
So my question is, how to do that faster?
for every word in your dictionary/array/set - sort letters in this word and store in separate dictionary/map, something like (in pseudo code)
Set<String> originalDictionary = {"word", "string"};
Map<String, String> sortedMap = {
"dorw" => "word", "ginrst" => "string"}
};
for your input string - sort letter again and check whether you have something in your dictionary
I don't know how large your dictionary is, but assume it has much less than 15! entries. So I would go through all dictionary entries and find the ones with length 15. Checking if they are permutations of your original string should be easy.
Started using Hadoop recently and struggling to make sense of a few things. Here is a basic WordCount example that I'm looking at (count the number of times each word appears):
Map(String docid, String text):
for each word term in text:
Emit(term, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
Firstly, what is Emit(w,1) supposed to be doing? I notice that in all of the examples I look at the second parameter is always set to 1, but I can't seem to find an explanation on it.
Also, just to clarify - am I correct in saying that term is the key and sum in Reduce form the key-value pairs (respectively)? If this is the case, is values simply a list of 1's for each term that got emitted from Map? That's the only way I can make sense of it, but these are just assumptions.
Apologies for the noob question, I have looked at tutorials but a lot of the time I find that a lot of confusing terminology is used and overall basic things are made to be more complicated than they actually are so I'm struggling a little to make sense of this.
Appreciate any help!
Take this input as an example word count input.
Mapper will split this sentence into words.
Take,1
this,1
input,1
as,1
an,1
example,1
word,1
count,1
input,1
Then, the reducer receives "groups" of the same word (or key) and lists of the grouped values like so (and additionally sorts the keys, but that's not important for this example)
Take, (1)
this, (1)
input (1, 1)
etc...
As you can see, the key input has been "reduced" into a single element, that you can loop over and sum the values and emit like so
Take,1
this,1
input,2
etc...
Good question.
As explained, the mapper outputs a sequence of (key, value) pairs, in this case of the form (word, 1) for each word, which the reducer receives grouped as (key, <1,1,...,1>), sums up the terms in the list and returns (key, sum). Note that it is not the reducer who does the grouping; it's the map-reduce environment.
The map-reduce programming model is different from the one we're used to working in, and it's often not obvious how to implement an algorithm in this model. (Think, for example, about how would you implement a k-means clustering.)
I recommend Chapter 2 of the freely-available Mining of Massive Data Sets book by Leskovec et al. See also the corresponding slides.
I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.
I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.
My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:
public static class ShortForm{
public String fullword;
public String abbreviation;
}
List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited
And some code that uses the list
for (ShortForm f: shortForms){
if (address.contains(f.abbreviation+","))
address = address.replace(f.abbreviation+",", f.fullword+",");
}
Now this works, but it's slow. Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?
====== UPDATE
Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.
StringBuilder fullFormed = new StringBuilder();
for (String s: Splitter.on(" ").split(add)){
if (shortFormMap.containsKey(s))
fullFormed.append(shortFormMap.get(s));
else
fullFormed.append(s);
fullFormed.append(" ");
}
return fullFormed.toString().trim();
Testing shows this to be over 13x faster that the original approach. Cheers davecom!
It would already be a bit faster if you skip contains() part :)
What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.
I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all:
http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html
Here's what I'm thinking:
Put abbreviation/full word pairs into TreeMap
Tokenize the address into words.
Check each word to see if it is a key in the TreeMap
Replace it if it is
Put the corrected tokens back together as an address
I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.
This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.
I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.
A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html