How to find features of a word variant (using SimpleNLG)? - java

My idea is to, given a word variant and a base form from another word, reproduce the features from the word variant in the base form.
I've been able to produce a word variant from a base form given a set of features, my problem lies on gathering these feature from the original word variant.
So far, my workaround is to use Stanford Parser and filter the POS tags of the word variant, thus recovering some (but not most) features. Then using SimpleNLG I'm able to create the new word variant.
Any other tools or libraries for Java, that provide these functionalities, are also welcome.
Thanks in advance

SimpleNLG is as the name suggests... simple. You may want to take a look at libraries handling language at a more semantic level. Of notable examples there is OpenCCG (http://openccg.sourceforge.net/). This is going to be a bit of work though.

Related

Scanning texts for specific words

I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.

File-names in web development: What shall one use as an word-separator in complex file names?

I ask myself what I shall use as an separator for complex files names like for example "Monthly Project Report".
I see a lot of people using hyphen. According: 'monthly-project.report.php'.
But I got some doubt concerning that because hyphen can be mistaken as an arithmetic minus in programming.
Wouldn't it be better to use an underscore (_)?
So what separator in appropriate to use?
That is not a problem. You better to use () If you not going to show the file in Google search engine. most of the developers normally use (-) for Google SEO targeting.
You should better use () as you mentioned some one may be have doubt if it is mines :)

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

preferred language/technique for sequence processing or parsing

I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.
You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith
If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]
You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

Categories