Natural language parsing, practical example - java

I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then?
I guess it would help with the responses. But what data structures and architecture could I use.

A part-of-speech tagger assigns labels to the words in the input text. For example, the popular Penn Treebank tagset has some 40 labels, such as "plural noun", "comparative adjective", "past tense verb", etc. The tagger also resolves some ambiguity. For example, many English word forms can be either nouns or verbs, but in the context of other words, their part of speech is unambiguous.
So, having annotated your text with POS tags you can answer questions like: how many nouns do I have?, how many sentences do not contain a verb?, etc.
For a chatbot, you obviously need much more than that. You need to figure out the subjects and objects in the text, and which verb (predicate) they attach to; you need to resolve anaphors (which individual does a he or she point to), what is the scope of negation and quantifiers (e.g. every, more than 3), etc.
Ideally, you need to map you input text into some logical representation (such as first-order logic), which would let you bring in reasoning to determine if two sentences are equivalent in meaning, or in an entailment relationship, etc.
While a POS-tagger would map the sentence
Mary likes no man who owns a cat.
to such a structure
Mary/NNP likes/VBZ no/DT man/NN who/WP owns/VBZ a/DT cat/NN ./.
you would rather need something like this:
SubClassOf(
ObjectIntersectionOf(
Class(:man)
ObjectSomeValuesFrom(
ObjectProperty(:own)
Class(:cat)
)
)
ObjectComplementOf(
ObjectSomeValuesFrom(
ObjectInverseOf(ObjectProperty(:like))
ObjectOneOf(
NamedIndividual(:Mary)
)
)
)
)
Of course, while POS-taggers get precision and recall values close to 100%, more complex automatic processing will perform much worse.
A good Java library for NLP is LingPipe. It doesn't, however, go much beyond POS-tagging, chunking, and named entity recognition.

Natural language processing is wide and deep, with roots going back at least to the 60s. You could start reading up on computational linguistics in general, natural language generation, generative grammars, Markov chains, chatterbots and so forth.
Wikipedia has a short list of libraries which I assume you might have seen. Java doesn't have a long tradition in NLP, though I haven't looked at the Stanford libraries.
I doubt you'll get very impressive results without diving fairly deeply into linguistics and grammar. Not everybody's favourite school subject (or so I've heard reported -- loved'em meself!).

I'll skip a lot many details and keep this simple. Parts of Speech tagging help you to create an parse tree out of a sentence. Once you have this, you try to make out a meaning as unambiguously as possible. The result of this parsing step will greatly aid you to frame a suitable response for you chatterbot.

Once you have part of speech tags you can extract, for example, all nouns, so you know roughly what things or objects someone is talking about.
To give you an example:
Someone says "you can open a new window." When you have the POS tags you know they are not talking about a can (as in container, jar etc., which would even make sense in the context of open), but a window. You'll also know that open is a verb.
With this information, your chat bot can generate a much better reply that will have nothing to do with can openers etc.
Note: You don't need a parser to get POS tags. A simple POS tagger is enough. A parser will give you even more information (e.g. what is the subject, what the object of the sentence?)

Related

Extract tags or relevant keywords from text

I need to extract relevant keywords or concepts similar to the AlchemyAPI's concept tagging method.
I would like to know if there's any tool that can provide something similar to the "concept tagging" of text or classification, not just steeming words or regex only.
A stand alone solution is preferably in my case as I have a lot of data and it reaches the rate limit quickly for Yahoo Term Extraction and AlchemyAPI..
E.g.
Input:
With that said Its the democratic publics decision on whether they agree or disagree
Ouputs:
Decision making
This is called Text Classification
Here is a 5-part video series on doing what you need with a tool called RapidMiner
http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

Natural Language Processing: Find obscenities in English?

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?
If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?
Some quick thoughts:
The Scunthorpe problem (and follow the links to "Swear filter" for more)
British or American English? fanny, fag etc
Political correctness: "black" or "Afro-American"?
Edit:
Be very careful and again here. Normal words can offend, whether by choice or ignorance
Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?
I'd make a huge list.
Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.
Note that any NLP logic like this will be subject to attacks of "character replacement":
For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)
One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.
It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.
Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...
I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-##$%^&*.]ck".
You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.
Here are a couple of useful links:
A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.
Data Mining (ppt) - This was written by a colleague of mine.
Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).
There are webservices that do this kind of thing in English.
I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.
At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.

Best practices in internationalizing text with lots of markup?

I'm working on a web project that will (hopefully) be available in several languages one day (I say "hopefully" because while we only have an English language site planned today, other products of my company are multilingual and I am hoping we are successful enough to need that too).
I understand that the best practice (I'm using Java, Spring MVC, and Velocity here) is to put all text that the user will see in external files, and refer to them in the UI files by name, such as:
#in messages_en.properties:
welcome.header = Welcome to AppName!
#in the markup
<title>#springMessage("welcome.header")</title>
But, having never had to go through this process on a project myself before, I'm curious what the best way to deal with this is when you have some segments of the UI that are heavy on markup, such as:
<p>We are excited to announce that Company1 has been acquired by
Division X,
a fast-growing division of Company 2, Inc.
(Nasdaq: BLAH), based in...
One option I can think of would be to store this "low-level" of markup in messages.properties itself for the message - but this seems like the worst possible option.
Other options that I can think of are:
Store each non-markup inner fragment in messages.properties, such as acquisitionAnnounce1, acquisitionAnnounce2, acquisitionAnnounce3. This seems very tedious though.
Break this message into more reusable components, such as Company1.name, Company2.name, Company2.ticker, etc., as each of these is likely reused in many other messages. This would probably account for 80% of the words in this particular message.
Are there any best practices for dealing with internationalizing text that is heavy with markup such as this? Do you just have to bite down and bear the pain of breaking up every piece of text? What is the best solution from any projects you've personally dealt with?
Typically if you use a template engine such as Sitemesh or Velocity you can manage these smaller HTML building blocks as subtemplates more effectively.
By so doing, you can incrementally boil down the strings which are the purely internationalized ones into groups and make them relevant to those markup subtemplates. Having done this sort of work using templates for an app which spanned multi-languages in the same locale, as well as multiple locales, we never ever placed markup in our message bundles.
I'd suggest that a key good practice would be to avoid placing markup (even at a low-level as you put it) inside message properties files at all costs! The potential this has for unleashing hell is not something to be overlooked - biting the bullet and breaking things up correctly, is far less of a pain than having to manage many files with scattered HTML markup. Its important you can visualise markup as holistic chunks and scattering that everywhere would make everyday development a chore since:
You would lose IDE color highlighting and syntax validation
High possibility that one locale file or another can easily be missed when changes to designs / markup filter down
Breaking things down (to a realistic point, eg logical sentence structures but no finer) is somewhat hard work upfront but worth the effort.
Regarding string breakdown granularity, here's a sample of what we did:
comment.atom-details=Subscribe To Comments
comment.username-mandatory=You must supply your name
comment.useremail-mandatory=You must supply your email address
comment.email.notification=Dear {0}, the comment thread you are watching has been updated.
comment.feed.title=Comments on {0}
comment.feed.title.default=Comments
comment.feed.entry.title=Comment on {0} at {1,date,medium} {2,time,HH:mm} by {3}
comment.atom-details=Suscribir a Comentarios
comment.username-mandatory=Debes indicar tu nombre
comment.useremail-mandatory=Debes indicar tu direcci\u00f3n de correo electr\u00f3nico
comment.email.notification=La conversaci\u00f3n que estas viendo ha sido actualizada
comment.feed.title=Comentarios sobre {0}
comment.feed.title.default=Comentarios
comment.feed.entry.title=Comentarios sobre {0} a {1,date,medium} {2,time,HH:mm} por {3}
So you can do interesting things with how you string replace in the message bundle which may also help you preserve it's logical meaning but allow you to manipulate it mid sentence.
As others have said, please never split the strings into segments. You will cause translators grief as they have to coerce their language syntax to the ad-hoc rules you inadvertently create. Often it will not be possible to provide a grammatically correct translation, especially if you reuse certain segments in different contexts.
Do not remove the markup, either.
Please do not assume professional translators work in Notepad :) Computer-aided translation (CAT) tools, such as the Trados suite, know about markup perfectly well. If the tagging is HTML, rather than some custom XML format, no special preparation is required. Trados will protect the tags from accidental modification, while still allowing changes where necessary. Note that certain elements of tags often need to be localized, e.g. alt text or some query strings, so just stripping all the markup won't do.
Best of all, unless you're working on a zero-budget personal project, consider contacting a localization vendor. Localization is a service just like web design. A competent vendor will help you pick the optimal solution/format for your project and guide you through the preparation of the source material and incorporating the localized result. And of course they and their translators will have all the necessary tools. (Full disclosure: I am a translator / localization specialist. And don't split up strings :)
First off, don't split up your strings. This makes it much harder for localizers to translate text because they can't see the entire string to translate.
I would probably try to use placeholders around the links:
Division X
That's how I did it when I was localizing a site into 30 languages. It's not perfect, but it works.
I don't think it's possible (or easy) to remove all markup from strings, you need to have a way to insert the urls and any extra markup.
You should avoid breaking up your strings. Not only does this become a nightmare to translate, but it also makes grammatical assumptions which may not be correct in the target language.
While placeholders can be helpful for many things, I would not recommend using placeholders for URLs. This allows you to customize the URL for different locales. After all, no sense sending them to an English language page when their locale is Argentine Spanish!

Categories