What library I can use for parsing words in Java? - java

I'm trying to discover the type of words fitting they in a lot of categories (date, year, time, names, punctuation, email, etc). I was making my own code to detect this (and worked), but I found libraries like ANTLR and JavaCC.
What I want to do is a taks for these libraries? If yes, what I should use, if not, there is something I can use for this?
What are the recomendations? JavaCC, ANTRL, there is some more? I see that JavaCC generate some classes, but there are things that I don't want it does like tokenization.

Depends on how powerful a parser you need. If you need something very powerful (such as JavaCC or ANTLR) go with them and don't spend too much time trying to make your own.
If you need something simple, then you can build a simple dictionary lookup parser with little more than regular expressions in Java or maybe even StringTokenizer (if your example is very simplistic).

Related

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

preferred language/technique for sequence processing or parsing

I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.
You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith
If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]
You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").

Alternative to XSLT?

on my project I have a huuuuge XSLT used to convert some XML files to HTML.
The problem is that this file is growing up day by day, it's hard to read, debug and test.
So I was thinking about moving all the parsing process to Java.
Do you think is a good idea? In case what libraries to parse XML and generate HTML(XML) do u suggest? performances will be better or worse?
If it's not a good idea any alternative idea?
Thanks
Randomize
Take a look at CDuce - it is a strictly typed, statically compiled XML processing language.
I once had a client with a similar problem - thousands of lines of XSLT, growing all the time. I spent an hour reading it with increasing incredulity, then rewrote it in 20 lines of XSLT.
Refactoring is often a good idea, and the worse the code is, the more worthwhile refactoring is. But there's no reason to believe that just because the code is bad and in need of refactoring, you need to change to a different programming language. XST is actually very good at handling variety and complexity if you know how to use it properly.
It's possible that the code is an accumulation of special handling of special cases, and each new special case discovered results in more rules being added. That's a tough problem to tackle in any language, but XSLT can deal with it better than most, provided you apply your mind all the time to finding abstract general rules that encompass all the special rules, so you only need to code the special rules as exceptions.
I'd consider Velocity as an alternative. I prefer it to XSL-T. The transforms are harder to write than templates, because the latter look exactly like the XML I wish to produce. It's a simple thing to add in the markup to map in the data.

Text processing / comparison engine

I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.

How do I make my own parser for java/jsf code?

Hi I'd like to make my own 'parser', e.g: computing (4+(3-4^2))*2 or
parsing java,jsf,html code.
In fact I did something like this but I feel it's not good.
Is there anything good for me? I've tried to read more, but I'm bit confused, LL, LR, AST,BNF,javacc yacc etc :). I'm not sure which way to go, when I would like to compute 4+...
or if I'd like to parse java,jsf code and produce something from this(another java code)
Is there anything generaly good enough like ast? or something which I can use for both?
thank you for help.
Before anything else, you have to understand that everything about parsing is based on grammars.
Grammars describe the language you want to implement in terms of how to decompose the text in basic units and how to stack those units in some meaning ful way. You may also want to look for the token, non-terminal, terminal concepts.
Differences between LL and LR can be of two kinds: implementation differences, and grammar writing differences. If you use a standard tool you only need to understand the second part.
I usually use LL (top-down) grammars. They are simpler to write and to implement even using custom code.
LR grammars theoretically cover more kinds of languages but in a normal situation they are just a hindrance when you need some correct error detection.
Some random pointers:
javacc (java, LL),
antlr (java, LL),
yepp (smarteiffel, LL),
bison (C, LR, GNU version of the venerable yacc)
Parsers can be pretty intense to write. The standard tools are bison or yacc for the grammar, and flex for the syntax. These all output code in C or C++.
ANTLR is probably the way to go for java. It is a little intense, the book is apparently very good (I have only struggled with the online docs).
If you can stretch to other languages, then lex/yacc (or flex/bison) is the standard for C although I wouldn't particularly recommend either of those combinations (steep learning curve, showing their age a little now).
Python has about a million parsers available (SimpleParse, Yapps) or there is TreeTop for Ruby - the developer even has a demo that does simple calculations as in your question - but note that this won't do everything that a LALR parser can accomplish.
If it is a learning exercise, try starting with a top-down parser -- they are simple to write and don't require including/learning any other tools. Best place to research the basics is probably wikipedia or code-project.
ANTLR, but make sure you read The Definitive ANTLR Reference, which will walk you through the creation of parsers. ANTLR does top-down, LL parsers, so the book doesn't address LALR and other types.
JavaCC, Yacc, are SableCC are more traditional lexer/parser generators, and you'll find that they're a little more primitive and have steeper learning curves. ANTLR is equally powerful, but you don't have to learn it all at once. Wikipedia offers a comprehensive comparison of parser generators.
BNF is a syntax for specifying the grammar; ANTLR uses its own, which I find more aesthetic but which others often don't.
You might want to check out http://antlr.org/. It will output java code. If I recall, one of their samples is pretty much what you want.
You might want to check out Building Parsers With Java by Steven John Metsker. The book seems to cover exactly what you are looking to do.
Using tools which generate Lexers and Parsers is generally far easier than writing your own from scratch.
In addition to whats already been listed, you could use things like JLex with CUP to create a simple interpreter for things like arithmetic expressions very easily.

Categories