Text processing in Java - java

Now this is a tricky problem for which I'm not able to figure out a good solution. Suppose we have a String in Java:- "He ate 3 apples today." Now the digit 3 can be easily identified in Java using isNumeric function or using regular expressions. But what if I have a String like: "He ate three apples today."? How can I identify that three is actually a number? I used OpenNlp and used its POS tagger but the time it takes to do is really too much! Can anyone suggest a better solution for this? Also among the ".bin" of OpenNlp, there is one file-"num.bin", but I don't know how to use this file. OpenNlp documentation also say nothing about it. Can anyone tell me if this is exactly what I've been looking for, and if yes then how to use it.
/*********************************************************************************************************************************/
I'm actually short of time here, so I've settled on a temporary solution here. Make a file/dictionary and take all the entries in a hashtable. Then I'll tokenize my sentence and check word by word for numbers, similar to what you guys suggested. I'll keep on updating the file as and when required. Thanks for your valuable suggestions guys, and if you have got something better than this I'd be really glad. OpenNlp implements this in a very good way, the only problem with it is time complexity and I want to do this in minimum time possible.

Create a dictionary of numbers. Search for elements from that dictionary in the text.
Check asympotic complexity, it may be cheaper to sort the text first.

You have to keep all that words in arrays and then use it. Here is an example how to convert number to string. It may help you... I think you have to split your text into words and check if a word is a number (three). If yes check the next word because it can be say "million", then check the next word and so on. It's not easy and seems like a little library.I think you'll spend a lot of time writing this. Or try to search in google for a library like this. Maybe someone have already got this problem, wrote a library and shares it for free )) Good luck.

Related

String Name; Method 1: Atleast 2 words, max 4 words, no special signs and only letters

I've been searching around and havn't quite found my answer.
At this moment me and along with my group have created a few classes resembling a Bank with Customer and Account and so on.
I've been struggling lately with trying to improve and secure our code by making our variable called "name" only respond to certain inputs.
In this case, I want to make it only possible for the person to enter name as such:
Atleast 2 words = (For the word part I've seen codes where you count towards the white space between but don't know yet what you do about the last word since there wont be a white space)
Max 4 words = ( Same thing here)
No special signs such as ,!%¤"#()=%/'¨. = ( for this, I've read something about "Matcher and pattern" )
Now I'm quite new to Java and I'm not asking for a code from someone, I'm asking for someone to point me in the right directions regarding codes, because alot of what i've seen like the Matcher and pattern are things that you import with downloading utils and stuff but I reckon that it's not needed and there should be a simpler more basic way as I'm not trying to get ahead of myself with copying codes just to get it done.
So yeah, the String "name" is used alot in our main class "Banklogic" where almost every method that adds something has the variable "name" in it, so it's quite important that I get this done.
I hope I was clear enough and any help would be appreciated! I'm gonna put the alarm for 3 hours before school to see what you guys have come up with so I can try and complete the code before our meeting! Thanks alot in advance :)
Since you asked for hints, you can use Regex to add such rules.
For Numbers only:
if(string.matches("[0-9\\W]")
//allow insertion of data else not
As for rules related Word Count:
string.split("\\W") will create an array separated by space character. You can count the number of elements in this array and allow/disallow input based on that.
As for no signs and only letters:
if(string.matches("[a-zA-Z\\W]")
// Allow Input else not
You can use Document Filter to implement these methods. Document filter will only allow text to be entered if you allow it to.
I hope this helped as a hint.
Also, note that \\W is for whitespaces. If you dont want to allow whitespaces, remove that char.
This is the most effective and simple way of doing the task.
EDIT:
This is a Class I wrote a little while ago to achieve such tasks. Just in case if you are interested....

Obtaining the Subject of a String in Java

Suppose I tell my Java program to find the subject of the sentence:
I enjoy spending time with my family.
My program should output:
Tell me more about your family.
How would it go about doing this?
EDIT
I could do this by having an array of String and have that filled with every noun in the English dictionary, but is there a simpler way?
This is way too open-ended a question. But a good place to start would be to learn about Natural Language Processing concepts and then look at using a framework like CoreNLP. It breaks down sentences into a parse tree and you can use this to identify parts of speech and things like the subject of a sentence. This is probably your best bet if you want a reasonably-reliable method.

How to check if two Strings are approximately equal?

I'm making a chat responder for a game and i want know if there is a way you can compare two strings and see if they are approximatley equal to each other for example:
if someone typed:
"Strength level?"
it would do a function..
then if someone else typed:
"Str level?"
it would do that same function, but i want it so that if someone made a typo or something like that it would automatically detect what they're trying to type for example:
"Strength tlevel?"
would also make the function get called.
is what I'm asking here something simple or will it require me to make a big giant irritating function to check the Strings?
if you've been baffled by my explanation (Not really one of my strong points) then this is basically what I'm asking.
How can I check if two strings are similar to each other?
See this question and answer: Getting the closest string match
Using some heuristics and the Levenshtein distance algorithm, you can compute the similarity of two strings and take a guess at whether they're equal.
Your only option other than that would be a dictionary of accepted words similar to the one you're looking for.
You can use Levenshtein distance.
I believe you should use one of Edit distance algorithms to solve your problem. Here is for example Levenstein distance algorithm implementation in java. You may use it to compare words in the sentences and if sum of their edit distances would be less than for example 10% of sentence length consider them equals.
Perhaps what you need is a large dictionary for similar words and common spelling mistakes, for which you would use for each word to "translate" to one single entry or key.
This would be useful for custom words, so you could add "str" in the same key as "strength".
However, you could also make a few automated methods, i.e. when your word isn't found in the dictionary, to loop recursively for 1 letter difference (either missing or replaced) and can recurse into deeper levels, i.e. 2 missing letters etc.
I found a few projects that do text to phonemes translations, don't know which one is best
http://mary.dfki.de/
http://www2.eng.cam.ac.uk/~tpl/asp/source/Phoneme.java
http://java.dzone.com/announcements/announcing-phonemic-10
If you want to find similar word beginnings, you can use a stemmer. Stemmers reduce words to a common beginning. The most known algorithm if the Port Stemmer (http://tartarus.org/~martin/PorterStemmer).
Levenshtein, as pointed above, is great, but computational heavy for distances greater than one or two.

Concatenating RowFilter orFilters with andFilter in Java

All the questions pertaining this don't seem to answer the particular question I have.
My problem is this. I have a list of search terms, and for each term I find the edit distance to find possible misspelling of a word.
So for each word separated by a space, I have possible words each word could be.
For example: searching for green chilli might give us "fuzzy" words "green, greene and grain" and "chilli, chill and chilly".
Now I want the RowFilter to search for: "green OR greene OR grain" AND "chilli OR chill OR chilly".
I can't seem to find a way to do this in Java. I've looked all over the place but nothing talks about concatenating the OR and AND filters together in one RowFilter.
Would I have to roll my own solution based on the model? I suppose I can do this, but my method would most probably be naive at first and slow.
Any pointers as to how to roll my own solution for this or better yet, what's the Java way to do this right?
RowFilter.orFilter() and RowFilter.andFilter() seem apropos; each includes examples, and each accepts an arbitrary number of arguments.

Is it possible to automate generation of wrong choices from a correct word?

The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.

Categories