We're starting to investigate a project that requires a tricky bit of XML parsing.
I like the look of Groovy's XmlSlurper (Groovy appears to be my Golden Hammer of choice at the moment). We'll need to process a pretty wide range of XML inputs and Groovy's dynamic nature might just let us work out a neat, concise solution. We'll see.
A concern is the cost of that flexibility and dynamism in terms of speed, though I've done no testing of that yet. Does anyone have any experience with this? Are Groovy and XmlSlurper particularly fast or slow compared to some of the Java alternatives for parsing XML?
I did not see serious performance problems with XmlSlurper but you should use it carefully:
If you need to parse few large XML-s you should have no problem with performance. According to this article XmlSlurper has been written to process large XML files.
If you need to parse many small XML-s you should use it in 'a Groovy way' and with pre-populated XML parser instance(s).
In my experience, the speed with which you can get something up and running in Groovy far outweighs any slowdown caused by its dynamic nature...
And in the rare instances it is severely impacting your application, you can always drop out the Groovy code, and write a Java class which adheres to the same Interface, and should plug straight in...
Hmmm...not really an answer this. I guess you could see it more as words of encouragement from the touch line ;-)
Related
I am working on a project that currently uses a bunch of very big xslt files.
we use those xslt's to translate an XML from our system to an XML that the other system can read.
Our system actually receives JSONs which we actually save as XMLs just for those xslts.
We are now thinking about a way to replace the xslt with something simpler, but we have a restriction:
Those xslt's are modified by outside people (which work on the other system), so just refactoring them is not an option, since its only a temporary solution until they will become ugly again. also, we still need to find a way to let those people change the way we transform the XML - preferably without teaching them how to code.
Since our system is written in java, we would also like our solution to be supported by one of the major java frameworks.
I was thinking about a sort of rule engine with XQuery for customization, but I am not sure if that is a valid solution.
Another idea I found was to just use ruby, since many people say that it does the job better. but I fear that the teaching overhead will be too great.
I would really appreciate any ideas you might have for solving this problem.
Thanks :)
I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.
I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?
Some people also recommended Python for XML parsing. Is it that good?
I would appreciate your kind advice.
SAX is very good parser but it is outdated.
Recently Oracle have launched new Parser to parse the xml files efficiently called Stax
*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*
Attached link will also shows comparisons of all parsers along with memory utilization and its features.
Thanks,
Pavan
Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here
Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.
If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
This question already has answers here:
Is there a good natural language processing library [closed]
(3 answers)
Closed 8 years ago.
I am willing to start developing a project on NLP. I dont know much of the tools available. After googling for about a month. I realized that openNLP can be my solution.
Unfortunately i dont see any complete tutorial over using the API. All of them are lacking of some general steps. I need a tutorial from ground level. I have seen a lot of downloads over the site but dont know how to use them? do i need to train or something?.. Here is what i want to know-
How to install / set up a nlp system which can-
parse a English sentence words
identify the different parts of speech
You say that you need to 'parse' each sentence. You probably already know this, but just to be explicit, in NLP, the term 'parse' usually means to recover some hierarchical syntactic structure. The most common types are constituent structure (e.g., via a context-free grammar) and dependency structure.
If you need hierarchical structure, I'd recommend you consider just starting with a parser. Most parsers I'm aware of include POS tagging during parsing, and may provide higher accuracy tagging than finite-state POS taggers (Caveat - I'm much more familiar with constituent parsers than with dependency parsers. It's possible some or most dependency parsers would require POS tags as input).
The big downside to parsing is the time complexity. Finite-state POS taggers often run at thousands of words per second. Even greedy dependency parsers are considerably slower, and constituent parsers generally run at 1-5 sentences per second. So if you don't need hierarchical structure, you probably want to stick with a finite-state POS tagger for efficiency.
If you do decide you need parse structure, a few recommendations:
I think the Stanford parser suggested by #aab includes both a constituent parser and a dependency parser.
The Berkeley Parser ( http://code.google.com/p/berkeleyparser/ ) is a pretty well-known PCFG constituent parser, achieves state-of-the-art accuracy (equal or superior to the Stanford parser, I believe), and is reasonably efficient (~3-5 sentences per second).
The BUBS Parser ( http://code.google.com/p/bubs-parser/ ) can also run with the high-accuracy Berkeley grammar, and improves efficiency to around 15-20 sentences/second. Full disclosure - I'm one of the primary researchers working on this parser.
Warning: both of these parsers are research code, with all the problems that engenders. But I'd love to see people actually using BUBS, so if it's of use to you, give it a try and contact me with problems, comments, suggestions, etc.
And a couple Wikipedia references for background if needed:
Context-free grammars: http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
Dependency grammars: http://en.wikipedia.org/wiki/Dependency_grammar
Generally you'd do these two tasks in the other order:
Do part-of-speech tagging
Run a parser using the POS tags as input
OpenNLP's documentation isn't that thorough and some of it's gotten hard to find due to the switch to apache. Some (potentially slightly out-of-date) tutorials are available in the old SF wiki.
You might want to take a look at the Stanford NLP tools, in particular the Stanford POS Tagger and the Stanford Parser. Both have downloads that include pre-trained model files and they also have demo files in the top-level directory that show how to get started with the API and short shell scripts that show how to use the tools from the command-line.
LingPipe might be another good toolkit to check out. A quick search here will lead you to a number of similar questions with links to other alternatives, too!
See Illinois-Curator:
http://cogcomp.cs.illinois.edu/page/software_view/Curator
Demo:
http://cogcomp.cs.illinois.edu/curator/demo/
It gives you almost everything at one place.
The most popular are:
GATE: easy to use and fairly quick to start with
UIMA: slow learning curve but more efficient and more generic
I am asked to develop a software which should be able to create Flow chart/ Control Flow of the input Java source code. So I started researching on it and arrived at following solutions:
To create flow chart/control flow I have to recognize controlling statements and function calls made in the given source code Now I have two ways of recognizing:
Parse the Source code by writing my own grammars (A complex solution I think). I am thinking to use Antlr for this.
Read input source code files as text and search for the specific patterns (May become inefficient)
Am I right here? Or I am missing something very fundamental and simple? Which approach would take less time and do the work efficiently? Any other suggestions in this regard will be welcome too. Any other efficient approach would help because the input source code may span multiple files and can be fairly complex.
I am good in .NET languages but this is my first big project in Java. I have basic knowledge of Compiler Design so writing grammars should not be impossible for me.
Sorry If I am being unclear. Please ask for any clarifications.
I'd go with Antlr and use an existing Java grammar: https://github.com/antlr/grammars-v4
All tools handling Java code usually decide first whether they want to process the language Java or Java byte code files. That is a strategic decision and depends on your use case. I could image both for flow chart generation. When you have decided that question. There are already several frameworks or libraries, which could help you on that. For byte code engineering there are: ASM, JavaAssist, Soot, and BCEL, which seems to be dead. For Java language parsing and analyzing, there are: Polyglot, the eclipse compiler, and javac. All of these include a complete compiler frontend for Java and are open source.
I would try to avoid writing my own parser for Java. I did that once. Java has a rather complex grammar, but which can be found elsewhere. The real work begins with name and type resolution. And you would need both, if you want to generate graphs which cover more than one method body.
Eclipse has a library for parsing the source code and creating Abstract Syntax Tree from it which would let you extract what you want.
See here for a tutorial
http://www.vogella.de/articles/EclipseJDT/article.html
See here for api
http://help.eclipse.org/indigo/topic/org.eclipse.jdt.doc.isv/reference/api/org/eclipse/jdt/core/dom/package-summary.html#package_description
Now I have two ways of recognizing:
You have many more ways than that. JavaCC ships with a Java 1.5 grammar already built. I'm sure other parser generators ditto. There is no reason for you to either have to write your own grammar or construct your own parser.
And specifically 'read[ing] input source code files as text and search for the specific patterns' isn't a viable choice at all, as it isn't parsing, and therefore cannot possibly recognize Java programs correctly.
Your input files are written in Java, and the software should be written in Java, but this is your first project in Java? First of all, I'd suggest learning the language with smaller projects. Also you need to learn how to use graphics in Java (there are various libraries). Then, you should focus on what you want to show on your graphs. Or is text sufficient?
The way I would do it is to analyse compiled code. This would allow you to read jars without source and avoid parsing the code yourself. I would use Objectwebs ASM to read the class files.
Smarter solution is to use Eclipse's java parser. Read more here: http://www.ibm.com/developerworks/opensource/library/os-ast/
Or even more easy: Use reflection. You should be able to compile the sources, load the classes with java classloader and analyse them from there. I think this is far more easy than any parsing.
Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery, with built in capability for parsing, building ASTs, constructing symbol tables, extracting control and data flow, transforming the ASTs, prettyprinting ASTs back to text, etc.
DMS is parameterized by an explicit language definition, and has a large set of preexisting definitions.
DMS's Java Front End already computes control and data flow graphs, so your problem would be reduced to exporting them.
EDIT 7/19/2014: Now handles Java 8.
I'm about to port a smallish library from Java to Python and wanted some advice (smallish ~ a few thousand lines of code). I've studied the Java code a little, and noticed some design patterns that are common in both languages. However, there were definitely some Java-only idioms (singletons, etc) present that are generally not-well-received in Python-world.
I know at least one tool (j2py) exists that will turn a .java file into a .py file by walking the AST. Some initial experimentation yielded less than favorable results.
Should I even be considering using an automated tool to generate some code, or are the languages different enough that any tool would create enough re-work to have justified writing from scratch?
If tools aren't the devil, are there any besides j2py that can at least handle same-project import management? I don't expect any tool to match 3rd party libraries from one language to a substitute in another.
If it were me, I'd consider doing the work by hand. A couple thousand lines of code isn't a lot of code, and by rewriting it yourself (rather than translating it automatically), you'll be in a position to decide how to take advantage of Python idioms appropriately. (FWIW, I worked Java almost exclusively for 9 years, and I'm now working in Python, so I know the kind of translation you'd have to do.)
Code is always better the second time you write it anyway....
Plus a few thousand lines of Java can probably be translated into a few hundred of Python.
Have a look at Jython. It can fairly seamlessly integrate Python on top of Java, and provide access to Java libraries but still let you act on them dynamically.
Automatic translators (f2c, j2py, whatever) normally emit code you wouldn't want to touch by hand. This is fine when all you need to do is use the output (for example, if you have a C compiler and no Fortran compiler, f2c allows you to compile Fortran programs), but terrible when you need to do anything to the code afterwards. If you intend to use this as anything other than a black box, translate it by hand. At that size, it won't be too hard.
I would write it again by hand. I don't know of any automated tools that would generate non-disgusting looking Python, and having ported Java code to Python myself, I found the result was both higher quality than the original and considerably shorter.
You gain quality because Python is more expressive (for example, anonymous inner class MouseAdapters and the like go away in favor of simple first class functions), and you also gain the benefit of writing it a second time.
It also is considerably shorter: for example, 99% of getters/setters can just be left out in favor of directly accessing the fields. For the other 1% which actually do something you can use property().
However as David mentioned, if you don't ever need to read or maintain the code, an automatic translator would be fine.
Jython's not what you're looking for in the final solution, but it will make the porting go much smoother.
My approach would be:
If there are existing tests (unit or otherwise), rewrite them in Jython (using Python's unittest)
Write some characterization tests in Jython (tests that record the current behavior)
Start porting class by class:
For each class, subclass it in Jython and port the methods one by one, making the method in the superclass abstract
After each change, run the tests!
You'll now have working Jython code that hopefully has minimal dependencies on Java.
Run the tests in CPython and fix whatever's left.
Refactor - you'll want to Pythonify the code, probably simplifying it a lot with Python idioms. This is safe and easy because of the tests.
I've this in the past with great success.
I've used Java2Python. It's not too bad, you still need to understand the code as it doesn't do everything correctly, but it does help.