Retrain Stanford coreNLP Lemmatizer - java

I am trying to work out how the lemmatizer could support the particular use case of identifying e.g. certain regional variants in a corpus. The two possible approaches that I could follow could be:
amend the existing dictionary to use my own material, or
retrain the lemmatizer
I am wondering if you can point me to documentation that I can follow for either approach. I am aware that the Lemma module works in this way:
PropertiesUtils.asProperties(
"annotators", "tokenize,ssplit,pos,lemma",
"ssplit.isOneSentence", "true",
"tokenize.language", "en"));
Apart from CoreAnnotations.LemmaAnnotation, I wasn't able to track something for what I would like to do. I’d appreciate any help you can provide.
Many thanks

Related

Scanning texts for specific words

I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.

How to find features of a word variant (using SimpleNLG)?

My idea is to, given a word variant and a base form from another word, reproduce the features from the word variant in the base form.
I've been able to produce a word variant from a base form given a set of features, my problem lies on gathering these feature from the original word variant.
So far, my workaround is to use Stanford Parser and filter the POS tags of the word variant, thus recovering some (but not most) features. Then using SimpleNLG I'm able to create the new word variant.
Any other tools or libraries for Java, that provide these functionalities, are also welcome.
Thanks in advance
SimpleNLG is as the name suggests... simple. You may want to take a look at libraries handling language at a more semantic level. Of notable examples there is OpenCCG (http://openccg.sourceforge.net/). This is going to be a bit of work though.

extracting the meaning of a sentence

Is there some java library that helps extract the content of a sentence/paragraph?
Essentially what I need to do is get a context of what is being said (such as whether the sentence is providing a positive or negative point and that sort of thing).
I don't know of such a system and have been looking around and have not been able to find anything useful. does anyone know of something that might help with this?
thanks
Use GATE (https://gate.ac.uk/), a NLP & Machine Learning tool.
You can use ANNIE for splitting sentences and POS tagging.
You have to prepare training dataset, with sentiments already annotated manually and then use Batch Learning plugin to predict sentiment for training documents.
Step-by-step tutorial for this: https://gate.ac.uk/sale/talks/gate-course-may10/track-3/module-11-ml-adv/module-11-sentiment.pdf
And the example talked about in the pdf: https://gate.ac.uk/sale/talks/gate-course-may10/track-3/module-11-ml-adv/module-11-sentiment.zip

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

simple sentiment analysis with java

I am very new to Sentiment analysis. How can I judge if a given word or sentence is positive or negative. I have to implement it with java. I tried to read something like lingpipe, rapidminer tutorial, but I do not understand. In their examples they use a lot of data. In my case I do not have much data. All I have is a word or a sentence, lets say. I tried to read the questions from stackoverflow too. But they do not help me much.
Thanks in advance.
Computers don't know about a human thing like sentiment unless they learn it from examples that a human has labeled as positive or negative.
The goal of Machine Learning is in fact to make the most informed decision about a new example based on the empirical data of previous examples. Statistically, the more data, the better.
To "judge" the sentiment of a sentence, you'll need to have trained a model or classifier on some sentences labeled for sentiment. The classifier takes an unlabeled sentence as input and outputs a label: positive or negative.
First get training examples. I'm sure you can find some labeled sentiment data in the public domain. One of the best data set repositories is the UCI KDD Archive. You may then train a classifier on the data to judge new examples. There are a host of learning algorithm resources available. My favorites are jBoost, which can output a classifier as Java code, and Rapidminer, which is better for visual analysis.
You could use an existing web-service which is trained from prior data. For example:
Chatterbox Sentiment Detection API
Which has libraries for Java & Android.
(Disclosure: I work for the company that builds this API)
This is not really programming related (neuro-linguistic programming is not programming), and in general there is no reliable solution.
My best idea is to make it work like Google "Pigeon"Rank, i.e. collect words and sentences, and then collect human feedback whether they are positive or negative, and then use Bayesian matching with this data.
Your can try to use Wordnet for searching word's Semantic Orientation based on "distance" calculation between your word and "good" or "bad" words.Shorter distance will give you word's SO. Results seems will be a bit weak but not a lot of data(or time) is necessary for this approach.

Categories