parsing/scanning/tokenizing "raw XML" - java

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.

I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml

I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.

I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.

Related

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

Regex a xml string

What would be the correct way to find a string like this in a large xml:
<ser:serviceItemValues>
<ord1:label>Start Type</ord1:label>
<ord1:value>Loop</ord1:value>
<ord1:valueCd/>
<ord1:activityCd>iactn</ord1:activityCd>
</ser:serviceItemValues>
1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:
private static Pattern LOOP_REGEX =
Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
Thanks
Regular expressions are not the best option when parsing large amounts of HTML or XML.
There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.
Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html
Look up XPath, which is kinda like regex for XML. Sort of.
With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.
I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:
"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.
I recommend using JDOM. It has an easy syntax. An example can be found here:
http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html
If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.
When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).
Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).
As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.

XML template in Java

I need to generate XML and they differ only in the values, that the tags contain.
Is it possible to create a template XML and then write only the values each time? (I do not want to go the JAXB way as these are small XMLs and are not worth creating objects for them).
Is this a good approach?
Any thoughts?
You can use freemarker or velocity for templating in java -- or even just add PHP tags to a sample XML to generate from a template.
I think as a general rule, though, once you start conditionally adding elements or attributes, or looping to generate multiples, you're better of generating your XML -- though I agree sometimes getting it into a format you want (not what the generator wants) is sometimes a pain.
As long as the XML file to be produced is small, simple and mostly consistent in format, I tend to buck the trend: I simply create and write a text string.
writer.out.format("<?xml version='1.0'><root><tag1>%s</tag1></root>", value1)
kinda thing.
Despite the fact that you are against jaxb (which I have yet to use), I wish to recommend a comparable way to do this with Apache's XMLBeans.
This requires you to use an xml schema - but from my experience it worth it...

How to parse badly formed XML in Java?

I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:
<mytag>This won't parse & contains an ampersand.</mytag>
The javax.xml.stream classes don't like this at all, and rightly error with:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.
How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.
My preference would be for a fix that doesn't require too much disruption to the existing parser code.
Use libraries such as tidy or tagsoup.
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.
Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.
I believe JSoup can handle badly formed XML

Categories