Parsing natural text to array - java

How can I parse natural strings like these:
"10 meters"
"55m"
Into instances of this class:
public class Units {
public String name; //will be "meters"
public int howMuch; //will be 10 or 55
}
P.S. I want to do this with NLP libraries, I'm really a noob in NLP and sorry for my bad english

It is possible, but I recommend you don't do this. An array usually holds only hold one type of data structures, so it cannot hold an int and a string at the same time. If you did do it, you would have to do Object[][]

You could use the following algorithm:
Separate the text into words by looping through each character and breaking off a new word each time you encounter a space: this can be stored in a String array. Make sure that each word is stored lowercase.
Store a 2-dimensional String array as a database of all the units you want to recognize: this could be done with each sub-array representing one unit and all its equivalent representations: for example, the sub-array for meters might look like {"meter","meters","m"}.
Make two parallel ArrayLists: the first representing all numerical values and the second representing their corresponding units.
Loop through the list of words from step 1: for each word, check if it is in the format nubmer+unit (without an adjoining space). If so, then split the number off and put it in the first ArrayList. Then, find the unabbreviated unit corresponding with the abbreviated unit given in the text by referring to the 2-dimensional string array (this should be the first index of the subarray). Add this unit to the second ArrayList. Finally, if the word is a single number, check if the next word corresponds with any of the units; if it does, then find its unabbreviated unit (the first index of the sub-array). Then add the number and its unit to their respective ArrayLists.

Related

Match mixed list elements with scanner

let's say I have a mixed list (String & integer data types) 100's of lines long.
i.e.
Lines=
<Thanks For 44 55>
<Helping Me 43 66>
etc...
I want to use the scanner class to match the two strings (& extract the corresponding numbers as well).
How can I write an isolate method to perform this.
The goal is to feed the two corresponding integer values into a separate calculate method.
Here's what I have.
private List<String> lines = new ArrayList<String>();
public void isolate( String s, String l){
if(String s = this.line.matches(".s*") && String l=this.line.matches(".l*")){
lineNew = lines[s][l];
//// extract the integer values (index 2, index 3) from linesNew here }
}
You've stated in comments:
Updated the specifics above, it's a list with integer & string datatypes.
And I beg to differ, that no, it's a List of String, period. The String might hold representations of ints and sub strings, but they're all held within a String. The key to a decent solution is to not do this, not use String to represent something which logically could be represented in a much sounder fashion. So,...
Create a custom class, one with two private String and two private int fields,
with getters and setters, constructors,...
Make it implement Comparable<...> even if it one or both of the numeric fields represents its "natural" order.
When you read in your line with the Scanner, parse the line into the constituent field types of this custom class of yours, and create an object with it.
Then place it into your List<MyCustomClass>.
Do this and creating your methods becomes trivial.
Consider using a 2nd inner Scanner, one for the line, and that helps you parse the line into two Strings and two ints. Be sure to close this inner Scanner with each iteration of the loop so as not to waste resources.

Efficiently checking for substrings and replacing them - can I improve performance here?

I need to examine millions of strings for abbreviations and replace them with the full version. Due to the data, only abbreviations terminated by a comma should be replaced. Strings can contain multiple abbreviations.
I have a lookup table that contains Abbreviation->Fullversion pairs, it contains about 600 pairs.
My current setup looks like something this. On startup I create a list of ShortForm instances from a csv file using Jackson and hold them in a singleton:
public static class ShortForm{
public String fullword;
public String abbreviation;
}
List<ShortForm> shortForms = new ArrayList<ShortForm>();
//csv code ommited
And some code that uses the list
for (ShortForm f: shortForms){
if (address.contains(f.abbreviation+","))
address = address.replace(f.abbreviation+",", f.fullword+",");
}
Now this works, but it's slow. Is there a way I can speed it up? The first step is to load the ShortForm objects with commas in place, but what else could I do?
====== UPDATE
Changed code to work the other way around. Splits strings into words and checks a set to see if the string is an abbreviation.
StringBuilder fullFormed = new StringBuilder();
for (String s: Splitter.on(" ").split(add)){
if (shortFormMap.containsKey(s))
fullFormed.append(shortFormMap.get(s));
else
fullFormed.append(s);
fullFormed.append(" ");
}
return fullFormed.toString().trim();
Testing shows this to be over 13x faster that the original approach. Cheers davecom!
It would already be a bit faster if you skip contains() part :)
What could really improve performance would be to use a better data structure than a simple array for storing your ShortForms. All of the shortForms could be stored sorted alphabetically by abbreviation. You could therefore reduce the lookup time from O(N) to something looking more like a binary search.
I haven't used it before, but perhaps the standard library's SortedMap fits the bill instead of using a custom object at all:
http://docs.oracle.com/javase/7/docs/api/java/util/SortedMap.html
Here's what I'm thinking:
Put abbreviation/full word pairs into TreeMap
Tokenize the address into words.
Check each word to see if it is a key in the TreeMap
Replace it if it is
Put the corrected tokens back together as an address
I think I'd do this with a HashMap. The key would be the abbreviation and the value would be the full term. Then just search through a string for a comma and see if the text that precedes the comma is in the dictionary. You could probably map all the replacements in a single string in one pass and then make all the replacements after that.
This makes each lookup O(1) for a total of O(n) lookups where n is the number of abbreviations found and I don't think there's likely a more efficient method.

How do I make an array from inputted information (i.e. names) and then use it as objects within the code?

I've been reading up on it, but every question I've found has asked for slightly different things, such as only wanting a single letter for their array, or in a different language (I'm new and only learning java at the moment), so here I am.
I want to set up an array that uses the user's input for their names.
What I have so far is this, I'm assuming this is the declaration line, where later I use an input line to define a value within the array (which I also am unsure how to do)
String[] array = {"name"};
But I don't know how to for example print.out the object or keep up with which name will be what value. I appreciate your time taken to teach me!
EDIT for further clarification. I'm trying to write up a small app that asks the user for numerous names, addresses, and phone numbers (Type name -> Type name's address -> type name's phone number, ask if they want to add another person, if yes then go back to asking for another name)
I am unsure how to set up a String array or how to use it throughout. However, thanks to your input and coming back after some fresh air, I have a better idea how to word it for google. Thank you guys for your help, even if it was just to gesture a better articulated question.
An array is a sequence of values. You have created an array of Strings that is one String long. To access the value at a specific of an array, use array subscript notation: the name of the array followed by a pair of square brackets ([]) with the index in between them.
String[] anArrayOfStrings = {"string0", "string1", "string2"};
anArrayOfStrings[0]; //the first element
System.out.println(anArrayOfStrings[1]); //print the second element
anArrayOfStrings[2] = "new string value"; //assign the third element to a new value
if (anArrayOfStrings[0].equals("string0") //evaluate the first element and call a method
{
//this block will execute anArrayOfStrings[0] is "string0"
}
anArrayOfStrings[3]; //error, index out of bounds
Simply declaring the array would be
String[] names;
In your code you both declare and assign it in the same line by using an initializer list.
To assign individual elements, use the [] notation. Note that once you initialized you list to be only one String long, it cannot become longer than without be re-assigned. To declare an array of any size, you can use:
String[] arrayWithInitialSize = new String[5]; //holds five strings, each null to begin with

Spell checker solution in java

I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.
A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html

Help me understand question related to HashMap in Java

Im given a task which i am a little confused to understand. Here is the question statement:
The following program should read a file and store all its tokens in a member variable.
Your task is to write a single method that returns the number of items in tokenMap, the average length (as double value) of the elements in tokenMap, and the number of tokens starting with character "a".
Here the tokenMap is an object of type HashMap<String, Integer>;
I do have some idea about HashMap but what i want to know the "key value" for HashMap required is a single character or the whole word?? that i should store in tokenMap.
Also how can i compute the average length?
Looks like you have to use the entire word as the key.
The average length of tokens can be computed by summing the lengths of each token and dividing by the number of tokens.
In Java, you can find the number of tokens in the HashMap by tokenMap.size().
You can write loops that visit each member of the map like this:
for(String t: tokenMap.values()){
//t is a token
}
and if you look up String in the Java API docs you will see that it is easy to find the length of a String.
To compute the average length of the items in a hash map, you'll have to iterate over them all and count the length and calculate the average.
As for your other question about what to use for a key, how are we supposed to know? A hashmap can use practically any* value for a key.
*The value must be hashable, which is defined differently for different languages.
Reading the question closely, it seems that you have to read a file, extract each word and use it as the key value, and store the length of each key as the integer:
an example line
leads to a HashMap like this
an : 2
example : 7
line : 4
After you've built your map (made of keys mapping to entries, or seemingly elements in the question), you'll need to run some statistics over it to find
the number of keys (look at HashMap)
the average length of all keys (again, simple enough)
the number beginning with "a" (just look at the String)
Then make a value object containing these values and return it from the method that does the statistics.
I know I've given more information that you require, but someone else may benefit from a little extra help.
Guys there is some confusion. Im not asking for a solution. Im just confused for one thing.
For the time being, im gonna use String type as the key type.
The only confusion i have is once i read the file line by line, should i split it based upon words or based upon each character. So that the key value should be a single character type string or a String of whole word.
If you can go through the question statement, what do you suggest. That's all im asking.
should i split it based upon words or
based upon each character
The requirement is to make tokens, so you should split them based on words. Each word becomes a unique String key. It would make sense for the value to be the count of each token.
If the file you are reading has these three lines:
int alpha;
int beta;
float delta;
Then you should have something like
<"int", 2>
<";", 3>
<"alpha", 1>
<"beta", 1>
<"float", 1>
<"delta", 1>
(The semicolon may or may not be considered a token.)
Your average length would be ( 3x2 + 3x1 + 5 + 4 + 5 + 5) / 6.
Your length of tokens starting with "a" would be 5.0.
Look elsewhere on this forum for keySet and you should be good to go.

Categories