normalizing String with adding appropriate spacing - java

probably a very broad question for stackoverflow but here it goes,
I'm trying to normalize words within sentence, for example:
INPUT:
I developGeographicallydispersed teams through good ASDWEQ.
OUTPUT
(Notice the spaces between develop Geographically dispersed)
I develop Geographically dispersed teams through good ASDWEQ.
since using external API is out of option ( e.g. using google API).
I require to design our in house Java API
the obvious and naive solution would be something like this:
for all word in sentence do:
if word is in dictionary then ignore
else:
if word is reduce-able to a set of dictionary keywords then split
else ignore
od;
So before I start with such approach, my question is that if there is a better way of doing it? for example some an OPEN SOURCE library, or even different approach?

Did you have a look at Flex and Bison ? It helps to create a scanner and define your patterns for text processing, you should find a trick to map your parser to an existing dictionary in your case.

Related

How to select strings with the most keywords matches?

I'm trying to select top 3 strings which contains the most matches..
I'll explain it like this:
assume that we have the following keywords: "pc, programming, php, java"
and the following sentences:
a[0]="what is java??"<br>
a[1]="I love playing and programming on pc"<br>
a[2]="I'm good at programming php and java"<br>
a[3]="I'm programming php and java on my pc"<br>
so only the last 3 strings must be selected cause they are the top 3 strings containing the most matches.
How to do this in java???
If your dataset is small and you only care about exact matches, you could do something like the following:
Loop over each of your sentences performing an indexOf check for each keyword. If this returns something that isn't -1 then increment a counter for that sentence. Repeat for each keyword. At the end find the 3 sentences that have the highest counter.
This approach will have all kinds of issues though including things such as:
Case insensitivity
Tags matching partial words, e.g. "java" matching "javascript"
Ideally you would use a full text engine like Lucene/Solr/ElasticSearch and let that do all the heavy lifting for you
Arguably the easiest method would be to use Regex, an expression based system which searches for patterns within strings.
Pick up a website which teaches Regex. I suggest this one for starters.
http://regexone.com/
Afterwards, familiarize yourself with Java Regex. I suggest looking into capture groups.
I will not give you code to do this, because I believe there are many online examples you can look at, and it is in your best interest to learn how to do this by yourself.

Defining words using Java

I was wondering if there as an API in Java that can define words and find the origins of words. I remember awhile back searching this up and seeing "apache commons" but I am not sure.
So basically, the user will be able to enter a word "overflow" then the program will be able to define the word. So I am looking for an API that can define words and find origins of words. So the word "recherche" would have an origin that is "French".
WordNet will give you half of what you are looking for: you can look up the definition for a word. Note that there are several implementations of WordNet for Java: jwi, jaws, Dan Bikel's, WordnetAPI. Some of these might be easier to use for your purpose than jwordnet suggested by miku (I have only used jaws and jwi).
Note: WordNet will not give you origins (AFAIK). I'm not aware of a software that does.
Note: You will have to provide the lemma of a word to be able to look it up in the dictionary. This means that you will have to apply some Natural Language Processing (NLP) techniques if you want to do this automatically on a free-text document (which can contain inflected forms). If you go this route, I'd suggest the GATE project's Morph plugin.
Wordnet maybe? There is a Java wrapper for it: http://sourceforge.net/projects/jwordnet/
Another list of NLP toolkits:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
To detect a language:
http://www.jroller.com/melix/entry/nlp_in_java_a_language
There is a website for etymology: http://www.etymonline.com/
It gives the result:
recherche
1722, from Fr. recherché "carefully sought out," pp. of rechercher "to seek out." Commonly used 19c. of food, styles, etc., to denote obscure excellence.
Don't know if they got an API but use some sort of script to query it.
So find a good way of detecting "Fr." in the sentence above.
Cheers,
Erik
Have you look for JWKTL?
"Wiktionary is a multilingual, web-based, freely available dictionary,
thesaurus and phrase book, designed as the lexical companion to
Wikipedia. Lately, it has been recognized as a promising lexical
semantic resource for natural language processing applications."
Using this, you can see the etymology of words.

Source for iterating through all words of english dictionary

I need to iterate through all words of english dictionary & filter certain based on whether they are noun/verb or anything else & certain other traits . Is there any thing I could use as a source for these words ?
Just wanted to mention, with regards to WordNet, there are 'stop words' which are not included. Some people online have made lists of stopwords, but I'm not sure how complete they are.
Some stop words are: 'the', 'that', 'I', 'to' 'from' 'whose'.
A larger list is here:
http://www.d.umn.edu/~tpederse/Group01/WordNet/wordnet-stoplist.html
For a list of words see this sourceforge project:
http://wordlist.sourceforge.net/
You may also want to search for the usecases of such a list, in order to find a suitable data source.
For instance:
Spell checking algorithms use a word list (stand alone spell checkers, word processing apps like OpenOffice, etc).
Word game algorithms use words (Scrabble type games, vocabulary education games, crossword puzzle generators)
Password cracking algorithm use words to help find weak passwords.
outpost9.com/files/WordLists.html
Also there are several Java APIs to choose from, and only some work with the latest dictionary (3.1) The one by MIT uses Java 5 and words with WordNet 3.1.
I recommend WordNet from princeton.edu. It is a popular English lexical database with word attributes such as:
Short definition
Part of speech, e.g. noun, verb, adjective, &c.
Synonyms and groupings
There is a WordNet Java API from smu.edu that will simplify using WordNet in your application. You might also download the database and parse it yourself, as its only 12MB compressed.

Partial match on a dictionary

I am working with GATE (Java Based NLP Framework) and want to find words with partial match with a dictionary.
For example I have a disease dictionary with following terms
Congestive cardiac failure
Congestive Heart Failure
Colon Cancer
.
.
.
Thousands of more terms
Let's assume I have as string "Father had cardiac failure last year" from this string I want to identify "cardiac failure" as partial match because it occurs as part of a term in the dictionary.
I have seen some discussion on similar subject in Python, JS and C# but I am not sure what can help in such a case here.
I wonder if I can utilize Aho-Corrasick over here.
The UIMA Concept Mapper annotator addon includes a functionality similar to what you are looking. You may consider:
including using UIMA inside GATE: http://gate.ac.uk/userguide/chap:uima
develop a similar component using the main ideas from the addon
Maybe you should use Lucene. Treat each line of the dictionary as a document, and each sentence in the text as a query.
One question that arises is which substrings you want to include in the search. If you included all substrings just "Heart" would also be a match, but that is not really a disease.
Maybe all right-aligned (word-)substrings (perhaps with length > 1) would be acceptable.
So one thing you could do is to train the Aho-Corrasick pattern matcher with the substrings you want to include. To keep the information from which dictionary term the substring came you probably need to modify the algorithm a bit (if keeping that information is important) or build another datastructure to look it up afterwards.
In any case I would convert the disease list and the documents you want to search to lower case before training/matching. If there is a chance of misspellings - there are also papers on fuzzy aho-corasick automata.

Java Regex, capturing groups with comma separated values

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.
When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/
Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.
The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

Categories