Parsing RTF Documents with Java/JavaCC

Parsing RTF Documents with Java/JavaCC - java

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:
Swing RTFEditorKit API
but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:
The RTF support was not written by the
Swing team. In the future we hope to
improve the support provided.
I don't think I'm going to wait for this to happen :)
The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:
PMD Applied JavaCC Grammar
which is ok and the following (which is the best so far).
Koders RTFParserDelegate and ETranslate Grammar
There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?
I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated

Does anybody know which is the most accurate grammar or whether there
is a better approach to this?
Many years ago I spent some time reading RTF (Wikipedia) with C#. I say reading because if you understand RTF in detail and use it the way it was designed you will realize that RTF is not meant to be read as a whole and parsed as a whole over and over again when editing. In the documentation you will find the syntax for RTF, but don't be misled into believing that you should use a lexer/parser. In the documentation they give a sample reader for RTF.
Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.

Presumably, the source of OpenOffice contains what you're looking for.

Related

Efficient Parser for large XMLs

I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.
I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?
Some people also recommended Python for XML parsing. Is it that good?
I would appreciate your kind advice.

SAX is very good parser but it is outdated.
Recently Oracle have launched new Parser to parse the xml files efficiently called Stax
*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*
Attached link will also shows comparisons of all parsers along with memory utilization and its features.
Thanks,
Pavan

Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here

Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.

If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Java - abstract syntax tree

I am currently looking for a Java 6/7 parser, which generates some (possibly standartized) form abstract syntax tree.
I have already found that ANTLR has a Java 6 grammar, but it seems, that it only generates parse tree, but not syntax tree. I have also read about Java Compiler API - but all the soources mentioned, that it is overdesigned and poorly documented (and I havent found, if it really generates the AST).
Do you know about any good parser library, with possibly as standardized output as possible?
Thanks

Basically JavaCC and ANTLR are the best tools out there at the moment.
You can find a usable Java 6 grammar in the project's grammar repository. JavaCC is a bit oldschool, rarely updated, but easy to start with, Java-oriented, and generates the AST (search for JJTree). It's a bit, well... strange on the first sight, but you can get used to it.
Both tools have a nice IDE support (e.g., Eclipse plug-ins), but I think (based on your description) what you need is JavaCC. Give it a try.

Our DMS Software Reengineering Toolkit with its Java front end can provide an AST (example at SO).
The distinction you draw beween "needed for semantics" (AST) and "is an accident of the grammar" ("Concrete" or "Parse" tree) is interesting. It takes additional effort, somewhere, to drop the CST information to obtain an AST.
You can do that by hand coding the AST construction as semantic actions on rules. That takes effort, and likely gives you a pretty good answer. But this process can pretty much automated completely by observing that literal tokens don't need to be kept in the tree, that unary production chains are unnecessary (except where a unary production introduces semantics), and that lists can be formed automatically. (You can read more about this here: https://stackoverflow.com/a/5732290/120163)
This is the approach taken by DMS. You write the grammar. DMS parses and builds the AST using these idea. No additional work/semantic actions on your part.
For a stone-stable grammer that already has this done for you, there's not a clear advantage, and if all you want is an AST than using JavaCC or ANTLR will work. If the grammar can change, then it is easier with DMS's approach.
But, nobody wants just an AST. Its the first step in a long series of steps that leads to whatever tool you are imagining. As a practical matter with real tools, you will almost surely need "symbol tables" and the abiliy to determine which symbol table entry an identifier node selects. You may need control and data flow analysis. You may need to modify the AST to make changes if your tool is a "change" and not just an analysis tool, and for that you might want something that can match/patch arbitrary chunks of the AST using the surface syntax of your langauge (e.g., Java). Finally, you may want to regenerate source code from you AST as legal, compilable text.
These are not easy mechanisms to build. We think we are competent engineers; it took us some several months on and off over the last 5 years to get the Java grammars (1.3 to 6 and 7) right. It took us about a year to build the symbol table machinery for Java; how symbols are resolved are a lot more complicated than you think; go read the langauge standard.
DMS provides all of these capabilities for many langauges, including Java, out of the box. For those languages with lesser support, it has parsing, prettyprinting, tree transformations, and attribute evaluation out of the box.
I've been hearing, for the last 20 years, If I just had a parser.... My experience (and the reason I built DMS) is that an AST is just not enough, by a long shot.
And I think what DMS provides (far) above and beyond "mere parsing" sets it far apart from "JavaCC and ANTLR". I do not believe they are "the best tools out there at the moment", unless you are optimizing on "free" and not "getting the job done". (If you want a free tool closer to the mark, consider using Eclipse's Java parsing machinery. At least it has, AFAIK, symbol table lookup).

I know two open source project to create and manipulate the Java AST:
javaparser
Eclipse JDT

Simple Natural Language Processing Startup for Java [duplicate]

This question already has answers here:
Is there a good natural language processing library [closed]
(3 answers)
Closed 8 years ago.
I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.
Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-
How to install / set up a nlp system which can-
parse a English sentence words
identify the different parts of speech

You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.
If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).
The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.
If you do decide you need parse structure, a few recommendations:
I think the Stanford parser suggested by #aab includes both a constituent parser and a dependency parser.
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.
And a couple Wikipedia references for background if needed:
Context-free grammars: http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
Dependency grammars: http://en.wikipedia.org/wiki/Dependency_grammar

Generally you'd do these two tasks in the other order:
Do part-of-speech tagging
Run a parser using the POS tags as input
OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.
You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.
LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!

See Illinois-Curator:
http://cogcomp.cs.illinois.edu/page/software_view/Curator
Demo:
http://cogcomp.cs.illinois.edu/curator/demo/
It gives you almost everything at one place.

The most popular are:
GATE: easy to use and fairly quick to start with
UIMA: slow learning curve but more efficient and more generic

Parsing Java Source Code

I am asked to develop a software which should be able to create Flow chart/ Control Flow of the input Java source code. So I started researching on it and arrived at following solutions:
To create flow chart/control flow I have to recognize controlling statements and function calls made in the given source code Now I have two ways of recognizing:
Parse the Source code by writing my own grammars (A complex solution I think). I am thinking to use Antlr for this.
Read input source code files as text and search for the specific patterns (May become inefficient)
Am I right here? Or I am missing something very fundamental and simple? Which approach would take less time and do the work efficiently? Any other suggestions in this regard will be welcome too. Any other efficient approach would help because the input source code may span multiple files and can be fairly complex.
I am good in .NET languages but this is my first big project in Java. I have basic knowledge of Compiler Design so writing grammars should not be impossible for me.
Sorry If I am being unclear. Please ask for any clarifications.

I'd go with Antlr and use an existing Java grammar: https://github.com/antlr/grammars-v4

All tools handling Java code usually decide first whether they want to process the language Java or Java byte code files. That is a strategic decision and depends on your use case. I could image both for flow chart generation. When you have decided that question. There are already several frameworks or libraries, which could help you on that. For byte code engineering there are: ASM, JavaAssist, Soot, and BCEL, which seems to be dead. For Java language parsing and analyzing, there are: Polyglot, the eclipse compiler, and javac. All of these include a complete compiler frontend for Java and are open source.
I would try to avoid writing my own parser for Java. I did that once. Java has a rather complex grammar, but which can be found elsewhere. The real work begins with name and type resolution. And you would need both, if you want to generate graphs which cover more than one method body.

Eclipse has a library for parsing the source code and creating Abstract Syntax Tree from it which would let you extract what you want.
See here for a tutorial
http://www.vogella.de/articles/EclipseJDT/article.html
See here for api
http://help.eclipse.org/indigo/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/package-summary.html#package_description

Now I have two ways of recognizing:
You have many more ways than that. JavaCC ships with a Java 1.5 grammar already built. I'm sure other parser generators ditto. There is no reason for you to either have to write your own grammar or construct your own parser.
And specifically 'read[ing] input source code files as text and search for the specific patterns' isn't a viable choice at all, as it isn't parsing, and therefore cannot possibly recognize Java programs correctly.

Your input files are written in Java, and the software should be written in Java, but this is your first project in Java? First of all, I'd suggest learning the language with smaller projects. Also you need to learn how to use graphics in Java (there are various libraries). Then, you should focus on what you want to show on your graphs. Or is text sufficient?

The way I would do it is to analyse compiled code. This would allow you to read jars without source and avoid parsing the code yourself. I would use Objectwebs ASM to read the class files.

Smarter solution is to use Eclipse's java parser. Read more here: http://www.ibm.com/developerworks/opensource/library/os-ast/

Or even more easy: Use reflection. You should be able to compile the sources, load the classes with java classloader and analyse them from there. I think this is far more easy than any parsing.

Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery, with built in capability for parsing, building ASTs, constructing symbol tables, extracting control and data flow, transforming the ASTs, prettyprinting ASTs back to text, etc.
DMS is parameterized by an explicit language definition, and has a large set of preexisting definitions.
DMS's Java Front End already computes control and data flow graphs, so your problem would be reduced to exporting them.
EDIT 7/19/2014: Now handles Java 8.

Are there any tools to isolate the content of a webpage?

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.

I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/

You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.

im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure

I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.