Build a natural language model that fixes misspellings - java

What are books about how to build a natural language parsing program like this:
input: I got to TALL you
output: I got to TELL you
input: Big RAT box
output: Big RED box
in: hoo un thum zend three
out: one thousand three
It must have the language model that allows to predict what words are misspelled !
What are the best books on how to build such a tool??
p.s. Are there free webservices to spell-check? From Google maybe?..

Peter Norvig has written a terrific spell checker. Maybe that can help you.

You have at least three options
You can write a program which understands the language (i.e. what a word means). This is a topic for research today. Expect the first results when you can buy a computer which is fast enough to run such a program (which is probably in 10 years when computers have become 1000 times faster than today).
Use a huge corpus (text documents) to train a Hidden Marcov Model.
Use a huge corpus and generate statistics about quadruplets n-grams, i.e. how often a tuple of N words appears. I don't have a link handy for this but the idea is that some words always appear in the context of other words. So when you parse your text into 4-grams and look them up in your database and you can't find one, chances are that there is something wrong with the current tuple. The next step is to find all possible matches (other 4-grams which have a small soundex or similar distance to the current one) and try the one with the highest frequency.
Google has this data for quite a few languages and you might find more in Google labs about this.
[EDIT] After some googling, I finally found the link: On this page, you can buy English 1- to 5-grams which Google collected over the whole Internet on 6 DVDs.
Googling for "google spelling statistics n-grams" will also turn up some interesting links.

soundex (wiki) is one option

There are quite a few Java libraries for natural language processing that would help you implement a spelling corrector. But you asked about a book. Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze looks like a good option. The first author is a Stanford Professor leading a group that does natural language processing and developing Java libraries and NLP resources that many people use.

In Dev Days London, Michael Sparks presented a Python script coded exactly for that. It was surprisingly very simple! See if you can find in Google. Maybe somebody here will have the link.

Related

Java text and keyword qualification

I have a 140 characters texts and a set of keywords.
What I want to do is to write an algorithm that will help me compute a percentage matching between my text and keywords in order to qualify a text as repesenting an IT event annonciation.
For example:
Text: "Tomorrow will take place our weekly event which about computer. We will discuss about how to implement algorithms. This will be very great."
keyword: "event, computer, database, Software, algorithms"
Here the matching is 3 words over 5 keywords which is 60%
Does that make sense, using word count and compare it to the number of keyword ? Is this approch accurate?
Does anyone has dealt with something like this before?
Thanks for your support.
Yes, this makes definitely sense. However, you will have to evaluate in practice whether it is precise enough for your purpose. It depends pretty much on the texts you are dealing with.
If you want to try something that is a bit more advanced but not too complex: Cosine similarity is another common measure to compare texts.
There are tons of algorithms and libraries for text classification. LingPipe is a nice Java library that might help you.
If you are interested in using a library, you find a good overview in the top answer to this quora question.

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Which algorithm to use to classify data by using 3 different parameters?

I am trying to develop an android (java) project for my Artificial Intelligent thesis. It is shortly based on story reading and word quiz. One person reads a story and marks the words that he doesnt know. These words are registered to WordPortfolio db that has "Word_id", "Seen"(how many times), "Asked" (How many times asked in quiz), "Right"(how many times answered right).
I have "Words" table in my db that has 3 different parameters to make one word unique. Those are "Priority", "Level" and a specifier whether it is a Verb, Noun, Adj, Adv. etc.
What I want to ask is;
Which algorithm can I use to classify these words to ask a "word-meaning question" wisely to the learner? I want learner to see the words he had seen in story-reading part more than one to consolidate the meaning of it and also I want him to learn new words.
There are many types of algorithms designed to do this. For instance, you could use a linear regression, nearest neightbor, clustering, or neural network. http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms provides a pretty comprehinsive list of the options out there.
I would also see if your library has the book "Programming Collective Intelligence" by Toby Segaran (http://shop.oreilly.com/product/9780596529321.do) or something similar in your library.
classification and clustering algorithms has implemented in many artificial intelligence software's such as MATLAB,WEKA,etc. you can see a sample of this in WEKA Text Classification for First Time & Beginner Users but i think your problem will have a good performance on MAP/REDUCE Freamework. I suggest you to use MAHOUT in your Problem which has a parallel framework and you can compromise your speed of other platforms with it.

simple sentiment analysis with java

I am very new to Sentiment analysis. How can I judge if a given word or sentence is positive or negative. I have to implement it with java. I tried to read something like lingpipe, rapidminer tutorial, but I do not understand. In their examples they use a lot of data. In my case I do not have much data. All I have is a word or a sentence, lets say. I tried to read the questions from stackoverflow too. But they do not help me much.
Thanks in advance.
Computers don't know about a human thing like sentiment unless they learn it from examples that a human has labeled as positive or negative.
The goal of Machine Learning is in fact to make the most informed decision about a new example based on the empirical data of previous examples. Statistically, the more data, the better.
To "judge" the sentiment of a sentence, you'll need to have trained a model or classifier on some sentences labeled for sentiment. The classifier takes an unlabeled sentence as input and outputs a label: positive or negative.
First get training examples. I'm sure you can find some labeled sentiment data in the public domain. One of the best data set repositories is the UCI KDD Archive. You may then train a classifier on the data to judge new examples. There are a host of learning algorithm resources available. My favorites are jBoost, which can output a classifier as Java code, and Rapidminer, which is better for visual analysis.
You could use an existing web-service which is trained from prior data. For example:
Chatterbox Sentiment Detection API
Which has libraries for Java & Android.
(Disclosure: I work for the company that builds this API)
This is not really programming related (neuro-linguistic programming is not programming), and in general there is no reliable solution.
My best idea is to make it work like Google "Pigeon"Rank, i.e. collect words and sentences, and then collect human feedback whether they are positive or negative, and then use Bayesian matching with this data.
Your can try to use Wordnet for searching word's Semantic Orientation based on "distance" calculation between your word and "good" or "bad" words.Shorter distance will give you word's SO. Results seems will be a bit weak but not a lot of data(or time) is necessary for this approach.

Natural Language Processing: Find obscenities in English?

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?
If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?
Some quick thoughts:
The Scunthorpe problem (and follow the links to "Swear filter" for more)
British or American English? fanny, fag etc
Political correctness: "black" or "Afro-American"?
Edit:
Be very careful and again here. Normal words can offend, whether by choice or ignorance
Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?
I'd make a huge list.
Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.
Note that any NLP logic like this will be subject to attacks of "character replacement":
For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)
One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.
It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.
Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...
I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-##$%^&*.]ck".
You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.
Here are a couple of useful links:
A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.
Data Mining (ppt) - This was written by a colleague of mine.
Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).
There are webservices that do this kind of thing in English.
I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.
At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.

Categories