Use jsoup or gquery for plain XML

Use jsoup or gquery for plain XML - java

I was recently wondering about a good library for XML manipulation in Java: A nice Java XML DOM utility
Before re-inventing the wheel, porting jQuery to Java in jOOX, I checked out these libraries:
http://jsoup.org
http://code.google.com/p/gwtquery
But at closer inspection, I can see:
jsoup does not operate on a standard org.w3c.dom document structure. They rolled their own implementation. I checked out the code and I doubt that it is as efficient and tuned as Xerces, for instance. For my use-cases, performance is important
jsoup seems tightly coupled with HTML. I only want to operate on XML, no HTML structure, no CSS
gwtquery is coupled with GWT. I'm not sure how tightly
Has anyone made any experience with these libraries when using it only for server-side XML, not for HTML?
I'm interested in
Performance benchmarks (maybe comparing it with standard DOM / XPath)
Compatibility experience (easy to import/export to standard DOM?)

Without an answer after one month, I think that my own library will resolve my problems best:
http://www.jooq.org/products/jOOX

Related

Efficient Parser for large XMLs

I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.
I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?
Some people also recommended Python for XML parsing. Is it that good?
I would appreciate your kind advice.

SAX is very good parser but it is outdated.
Recently Oracle have launched new Parser to parse the xml files efficiently called Stax
*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*
Attached link will also shows comparisons of all parsers along with memory utilization and its features.
Thanks,
Pavan

Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here

Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.

If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

What does it mean by implementing a DOM

I came across this phrase "implementing a DOM" and want to ask what does that mean exactly?
I think DOM is implemented by C++ in most browser. and DOM API is exposed to users through Javascript? So what does it mean to implement the DOM using PHP/Java or even Javascript such as jsdom did.
A more specific question is why would people want to re-implement DOM using other languages?
Thanks

I think DOM is implemented by C++ in most browser. and DOM API is exposed to users through Javascript?
Maybe. I'm sure that Internet Explorer uses MSXML for manipulating the DOM. As it's a set of COM components, it is available for use in many different languages. It's likely that this implementation is written in C++, but it doesn't really matter from the application's point of view.
So what does it mean to implement the DOM using PHP/Java or even Javascript such as jsdom did.
DOM is the recommended application programming interface for working with XML documents. Implementing a DOM basically means implementing an XML parser and tree structure library that complies to this interface.
This is API is a convention. It allows people familiar with DOM manipulation to "feel at home" when they use a new library. This usually happens when you use another language (e.g. server-side Java, Python, Ruby and client-side JavaScript), but may happen if you start using another project in the same language, but that project uses a different library.
A more specific question is why would people want to re-implement DOM using other languages?
Because not everyone agrees on which programming language to use. If you really like Haskell and you choose to manipulate documents in XML format (i.e. persisting data, or for communication with other software that understands XML -- web scaping, for instance), then you'll need to manipulate XML documents in Haskell. Then, you'll need a library for XML in Haskell.
Note that, even if people agreed on a single programming language, there would probably still be many different libraries as people disagree on political grounds, such as software licensing issues and programming style, desired features, etc.

Any good Java HTML parsers?

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?
I've tried Cobra's built in one and HTMLCleaner without any luck.

TagSoup is really great when dealing with crappy HTML/XHTML.
Jericho (and NekoHTML) are good too to parse non valid HTML.
TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

[Answering the title - the overall question and comments are not consistsent]
JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Stream-lined xml builder/parser in Android?

I'm learning the Android api from a book, and it seems like there isn't any mention of a stream-lined api for dealing with raw xml (reading and writing). His suggestion for parsing is the XmlPullParser, and his examples look horrendous considering the kind of api's I'm spoiled by in other platforms (LINQ to XML especially).
Is this the best available technique on the Android platform?
Obviously I can write a wrapper to avoid the repetitive stuff, but I'd be surprised if no such thing already exists.
Also, he doesn't even make mention of creating xml structures in code. What are my options for both?
On a side note, do any Java devs that are familiar with LINQ to XML in .Net know of anything equivalent in Java?

Since you probably don't want to load any substantial size DOMs into Android's memory - pull and SAX parsers are preferred way dealing with XML in Android. I think it pays to invest into understanding how SAX works and write a custom handler than rely on some generic libraries that may be incompatible or overbloated. I parse XML in my apps using SAX all the time and I'm very pleased with the speed (most of the time)

Well I'm pretty new to Java, but here's what I've gleaned so far about xml parsing on Android:
The XmlPullParser approach is recommended for Android due to resource constraints. There is a DOM parser available in Android, which would let you use XPath to navigate an xml document. Using the DOM means that you have to load the entire document into memory at once, however. The XmlPullParser method is much more efficient in terms of memory used.
The XmlPullParser method takes a little getting used to after being comfortable with LINQ to XML or XPath, but it's really not too bad IMHO (at least with the documents I was parsing). If you're working with small xml documents you could certainly use the DOM with XPath.
There's a decent article about the different methods for reading and writing XML with Android here:
http://www.ibm.com/developerworks/opensource/library/x-android/index.html

I had the same issues with parsing xml or xhtml and ended up writing a webservice doing it for me.
Android Device ->(Request URL) -> Webservice Get and Parse -
-> (Data) -> Android Device
You can transmit the data in JSON to work with it on the device.
The advantage of this is you can minimize the traffic on the slow mobile network and change the parsing without releasing a new android app.
Maybe this is will work for you too.
regards

Has anyone migrated from Struts 1 to another web framework?

On my current project, we've been using Struts 1 for the last few years, and ... ahem ... Struts is showing its age. We're slowly migrating our front-end code to an Ajax client that consumes XML from the servers. I'm wondering if any of you have migrated a legacy Struts application to a different framework, and what challenges you faced in doing so.

Sure. Moving from Struts to an AJAX framework is a very liberating experience. (Though we used JSON rather than XML. Much easier to parse.) However, you need to be aware that it's effectively a full rewrite of your application.
Instead of the classic Database/JSP/Actions scheme for MVC, you'll find yourself moving to a Servlet/Javascript scheme whereby the model is represented by HTTP GET requests, actions are represented by POST/PUT/DELETE requests, and the view is rendered on the fly by the web browser. This leads to interesting challenges in each area:
Server Side - On the server side you will need to develop a standard for exposing data to the client. The simplest and easiest method is to adopt a REST methodology that best matches your data's hierarchy. This is fairly simple to implement with servlets, but Sun also has developed a Java 1.6 scheme using attributes that looks pretty cool.
Another aspect of the server side is to choose a transmission protocol. I know you mentioned XML already, but you might want to reconsider. XML parsers vary greatly between browsers. One browser might make the document root the first child, another one might add a special content object, and they all parse whitespace differently. Even worse, the normalize() function doesn't seem to be correctly implemented by the major browsers. Which means that XML parsing is liable to be full of hacks.
JSON is much easier to parse and more consistent in its results. Javascript and Actionscript (Flash) can both translate JSON directly to objects. This makes accessing the data a simple matter of x.y or x[y]. There are also plenty of APIs to handle JSON in every language imaginable. Because it's so easy to parse, it's almost supported BETTER than XML!
Client Side - The first issue you're going to run into is the fact that no one understands how to write Javascript. ESPECIALLY those who think they do. If you have any books on Javascript, throw them out the window NOW. There are practically no good books on the language as they all follow the same "hacking" pattern without really diving into what they are doing.
From the lowest level, your team is going to need remedial training on Javascript development. Start with the Javascript Client Guide. It's the de facto source of information on the language. The next stop is Douglas Crockford's videos on Javascript. I don't agree with everything he has to say, but he's one of the few experts on the language.
Once you've got that down, consider what frameworks, if any, you want to use. Generally speaking, I dislike stuff like Prototype and Mootools. They tend to take a simple problem and make it worse. None the less, you can feel free to evaluate these tools and decide if they'll work for you.
If you absolutely feel that you cannot live without a framework because your team is too inexperienced, then GWT might fit the bill. GWT allows you to quickly write DHTML web apps in Java code, then compile them to Javascript. The PROBLEM is that you're giving up massive amounts of flexibility by doing this. The Javascript language is far more powerful than GWT exposes. However, GWT does let Java developers get up to speed faster. So pick your battles.
Those are the key areas I can think of. I can say that you'll heave a sigh of relief once you get struts out of your application. It can be a bit of a beast. Especially if you've had inexperienced developers working on your Struts model. :-)
Any questions?
Edit 1: I forgot to add that your team should study the W3C specs religiously. These are the APIs available to you in modern browsers. If you catch anyone using the DOM 0 APIs (e.g. document.forms['myform'].blah.value instead of document.getElementById("blah").value) force them to transcribe the entire DOM 1 specification until they understand it top to bottom.
Edit 2: Another key issue to consider is how to document your fancy new AJAX application. REST style interfaces lend themselves well to being documented in a Wiki. What I did was a had a top level page that listed each of the services and a description. By clicking on the service path, you would be taken to a document with detailed information on each of the sub-paths. In theory, this scheme can document as deep as you need the tree to go.
If you go with JSON, you will need to develop a scheme to document the objects. I just listed out the possible properties in the Wiki as documentation. That works well for simple object trees, but can get complex with larger, more sophisticated objects. You can consider supplementing with something like IDL or WebIDL in that case. (Can't be much worse than XML DTDs and Schemas. ;-))
The DHTML code is a bit more classical in its documentation. You can use a tool like JSDoc to create JavaDoc-style documentation. There's just one caveat. Javascript code does not lend itself well to being documented in-code. If for no other reason that the fact that it bloats the download. However, you may find yourself regularly writing code that operates as a cohesive object, but is not coded behind the scenes as such an object. Thus the best solution is to create JSDoc skeleton files that represent and document the Javascript objects.
If you're using GWT, documentation should be a no-brainer.

Check out the Stripes Framework. If you are familiar with struts then stripes will make sense to you, but it's so much better. They have a Stripes vs Struts section on their website. You could check that out and see if it interests you. It allows you to work with any ajax framework you want, and I don't think it would take long to migrate from struts to stripes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.