I've got a requirement to take an XML file and replace any existing value with one I generate from user input. Needs to only replace the existing value in the document.
I was looking at the simplest library SAX (https://docs.oracle.com/javase/tutorial/jaxp/sax/index.html) that is now in the standard JAVA JDK, but since this is an old project I was wondering if I should use something else like XMLT (https://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html).
Can someone please advise the best (easiest) approach for this simple case?
The fact that it's old does not change that it's XML. Use the library best suited to your needs. The standard SAX parser should be fine.
Also, if it's just a matter of replacing text content, why can't you just so a simple textual replace?
Related
I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.
I have a java string like the one below which has multiple lines and blank spaces. Need to remove all of them such that these are one line.
These are xml tags and the editor is not allowing to include less than symbol
<paymentAction>
Authorization
</paymentAction>
Should become
<paymentAction>AUTHORIZATION</paymentAction>
Thanks in advance
Calling theString.replaceAll("\\s+","") will replace all whitespace sequences with the empty string. Just be sure that the text between the tags doesn't contain spaces too, othewerise they'll get removed too.
You essentially want to convert the XML you have to Canonical Form. Below is one way of doing it but it requires you to use that library. If you doesn't want to depend upon external libraries then another option for you is to use XSLT.
The Canonicalizer class at Apache XML Security project:
NOTE: Dealing with non-xml aware API's (String.replaceAll()) is not generally recommended as you end up dealing with special/exception cases.
This is a start. Probably not enough, but should be in the right direction.
xml.replaceAll(">\\s*", ">").replaceAll("\\s*<, "<");
However, I'm tempted to say there has to be a way to create a document from the XML and then serialize it in canonical form as Pangea suggested.
I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.
I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml
I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.
I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.
What would be the correct way to find a string like this in a large xml:
<ser:serviceItemValues>
<ord1:label>Start Type</ord1:label>
<ord1:value>Loop</ord1:value>
<ord1:valueCd/>
<ord1:activityCd>iactn</ord1:activityCd>
</ser:serviceItemValues>
1st in this xml there will be a lot of repeats of the element above with different values (Loop, etc.) and other xml elements in this document. Mainly what I am concerned with is if there is a serviceItemValues that does not have 'Loop' as it's value. I tried this, but it doesn't seem to work:
private static Pattern LOOP_REGEX =
Pattern.compile("[\\p{Print}]*?<ord1:label>Start Type</ord1:label>[\\p{Print}]+[^(Loop)][\\p{Print}]+</ser:serviceItemValues>[\\p{Print}]*?", Pattern.CASE_INSENSITIVE|Pattern.MULTILINE);
Thanks
Regular expressions are not the best option when parsing large amounts of HTML or XML.
There are a number of ways you could handle this without relying on Regular Expressions. Depending on the libraries you have at your disposal you may be able to find the elements you're looking for by using XPaths.
Heres a helpful tutorial that may help you on your way: http://www.totheriver.com/learn/xml/xmltutorial.html
Look up XPath, which is kinda like regex for XML. Sort of.
With XPath you write expressions that extract information from XML documents, so extracting the nodes which don't have Loop as a sub-node is exactly the sort of thing it's cut out for.
I haven't tried this, but as a first stab, I'd guess the XPath expression would look something like:
"//ser:serviceItemValues/ord1:value[text()!='Loop']/parent::*"
Regular expression is not the right tool for this job. You should be using an XML parser. It's pretty simple to setup and use, and will probably take you less time to code. It then will come up with this regular expression.
I recommend using JDOM. It has an easy syntax. An example can be found here:
http://notetodogself.blogspot.com/2008/04/teamsite-dcr-java-parser.html
If the documents that you will be parsing are large, you should use a SAX parser, I recommend Xerces.
When dealing with XML, you should probably not use regular expressions to check the content. Instead, use either a SAX parsing based routine to check relevant contents or a DOM-like model (preferably pull-based if you're dealing with large documents).
Of course, if you're trying to validate the document's contents somehow, you should probably use some schema tool (I'd go with RELAX NG or Schematron, but I guess you could use XML Schema).
As mentioned by the other answers, regular expressions are not the tool for the job. You need a XPath engine. If you want to these things from the command line though, I recommend to install XMLStar. I have very good experience with this tool and solving various XML related tasks. Depending on your OS you might be able to just install the xmlstarlet RPM or deb package. Mac OS X ports includes the package as well I think.
I need to generate XML and they differ only in the values, that the tags contain.
Is it possible to create a template XML and then write only the values each time? (I do not want to go the JAXB way as these are small XMLs and are not worth creating objects for them).
Is this a good approach?
Any thoughts?
You can use freemarker or velocity for templating in java -- or even just add PHP tags to a sample XML to generate from a template.
I think as a general rule, though, once you start conditionally adding elements or attributes, or looping to generate multiples, you're better of generating your XML -- though I agree sometimes getting it into a format you want (not what the generator wants) is sometimes a pain.
As long as the XML file to be produced is small, simple and mostly consistent in format, I tend to buck the trend: I simply create and write a text string.
writer.out.format("<?xml version='1.0'><root><tag1>%s</tag1></root>", value1)
kinda thing.
Despite the fact that you are against jaxb (which I have yet to use), I wish to recommend a comparable way to do this with Apache's XMLBeans.
This requires you to use an xml schema - but from my experience it worth it...