Most efficient way to read and edit an xml file - java

I have an xml template file, some fields are blank and need to be filled by my application.
This has to result in an xml string representation of that file which will be given to another routine.
So, let's take this simple xml as example:
<root>
<name anAttr=""></name>
<age></age>
</root>
As you can see I'd have to read the xml and, in the parsing process, add some contents to it.
I though about using a sax parser and in the handler I would do something like this:
StringBuilder finalXml = new StringBuilder();
DefaultHandler handler = new DefaultHandler(){
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
finalXml.append("<"+qName+">");
if(qName.equals("name")){
finalXml.append("donald");
}
}
would it be correct/efficient this way? Or is there a better way?

I've used dom4j when i have wanted to parse xml in Java, and it's quite efficient.

If you have a choice of technology then I would suggest using JAXB .
It will unmarshal the XML into Java Object ,here do the modifications to java Object and then Marshal the modified Java Object into new XML File.
It has little bit of learning curve but code will be readable and maintainable.
for Basic tutorial of JAXB please refer to URL

Related

Parsing XML strings in MATLAB

I need to parse an XML string with MATLAB (caution: without file I/O, so I don't want to write the string to a file and then read them). I'm receiving the strings from an HTTP connection and the parsing should be very fast. I'm mostly concerned about reading the values of certain tags in the entire string
The net is full of death threats about parsing XML with regexp so I didn't want to get into that just yet. I know MATLAB has seamless java integration but I'm not very java savvy. Is there a quick way to get certain values from XML very very rapidly?
For example I want to get the 'volume' information from this string below and write this to a variable.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root>
<volume>256</volume>
<length>0</length>
<time>0</time>
<state>stop</state>
....
For what it's worth, below is the Matlab executable Java code to perform the required task, without writing to an intermediate file:
%An XML formatted string
strXml = [...
'<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>' char(10)...
'<root>' char(10) ...
' <volume>256</volume>' char(10) ...
' <length>0</length>' char(10) ...
' <time>0</time>' char(10) ...
' <state>stop</state>' char(10) ...
'</root>' ];
%"simple" java code to create a document from said string
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse(java.io.StringBufferInputStream(strXml));
%"intuitive" methods to explore the xmlDocument
nodeList = xmlDocument.getElementsByTagName('volume');
numberOfNodes = nodeList.getLength();
firstNode = nodeList.item(0);
firstNodeContent = firstNode.getTextContent;
disp(firstNodeContent); %Returns '256'
As an alternative, if your application allows it, consider passing the URL directly into your XML parser. Untested java code is below, but that probably also opens up the Matlab built-in xslt function as well.
xmlDocument = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder.parse('URL_AS_A_STRING_HERE');
Documentation here. Start at the "javax.xml.parsers" package.
There's an entire class of functions for dealing with xml, including xmlread and xmlwrite. Those should be pretty useful for your problem.
I am not familiar with Matlab's APIs at all, but I would point out that using the DOM method outlined by Pursuit will take the most time/memory if you only want specific values out of the XML stream you are getting back over the HTTP connection.
While STAX will give you the fastest parsing approach in Java, using the API can be unwieldy especially if you are not that familiar with Java. You could use SJXP which is an extremely thin abstraction ontop of STAX parsing in Java (disclaimer: I am the author) that allows you to define paths to the elements you want, then you give the parser a stream (your HTTP stream in this case) and it pulls out all the values for you.
As an example, let's say you wanted the /root/state and /root/volume values out of the examples XML you posted, the actual Java would look something like this:
// Create /root/state rule
IRule stateRule = new DefaultRule(Type.CHARACTER, "/root/state") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("State is: " + text);
}
}
// Create /root/volume rule
IRule volRule = new DefaultRule(Type.CHARACTER, "/state/volume") {
#Override
public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
System.out.println("Volume is: " + text);
}
}
// Create the parser with the given rules
XMLParser parser = new XMLParser(stateRule, volRule);
You can do all of that initialization on program start then at some point later when you are processing the stream from your HTTP connection, you would do something like:
parser.parser(httpConnection.getOutputStream());
or the like; then all of your handler code you defined in your rules will get called as the parser runs through the stream of characters from the HTTP connection.
As I mentioned I am not familiar with Matlab and don't know the proper ways to "Matlab-i-fy" this code, but it looks like from the first example you can more or less just use the Java APIs directly in which case this solution will both be faster and use significantly less memory for parsing if that is important than the DOM approach.

Ignore SOAP tags in XML file

I have a XML file with some SOAP tags that I want to ignore.
I was parsing the XML file with pull-parser but it stop working since that SOAP tags came along.
The XML file looks something like:
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getAllUsersListResponse xmlns:ns1="http://webservice.business.ese.wiccore.myent.com/">
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
and inside the tag <User> come all the tags that I want to parse (and I know how with pull-parser) and then
</User>]]></return>
<return xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"><![CDATA[<User>
until
</User>]]></return>
</ns1:getAllUsersListResponse>
</soap:Body>
</soap:Envelope>
The thing is, I know how to parse normal tags, but I don't want to parse this Soap tags, I want to IGNORE the SOAP tags! Anyone know how to achieve this?
Not being overly familiar with pull-parsing (I'm typically a SAX guy), I'm not probably not the most authoritative source on such things, but here goes...
I believe most (if not all) Java pull parsers should expose CDATA sections using a specific CDATA node (I believe in StAX, for example, the relevant event type is XMLStreamConstants.CDATA). As such, you'll want to parse your document and pull out that CDATA section (inside the SOAP <return> element) and extract its contents.
The contents of that section are the document you are interested in, so then you'd want to in turn run a new pull-parse over the contents you just extracted.
I'm sorry I can't be more help. Hopefully there will be someone else out there that can flesh the details out a bit more for you.
EDIT: in response to comments, you can achieve this using SAX as follows (exception handling omitted for brevity):
import org.xml.sax.ext.DefaultHandler2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
class MyParsingApp extends DefaultHandler2 // see note 1
{
private boolean inCdata, parsingSubDocument;
private String subDocument;
public static void main (String args[])
{
InputStream stream = ... // see note 2
XMLReader reader = XMLReaderFactory.createXMLReader(); // see note 3
reader.setContentHandler (new MyParsingApp ( ));
reader.parse (new InputSource(stream));
parsingSubDocument = true;
reader.parse (new InputSource(new StringReader(subDocument)));
...
}
public MyParsingApp ( )
{
inCdata = parsingSubDocument = false;
subDocument = "";
}
#Override
public void startCDATA() throws SAXException
{
inCdata = true;
}
#Override
public void endCDATA() throws SAXException
{
inCdata = false;
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException
{
if (inCdata)
subDocument += new String(ch, start, length); // see note 4
}
}
Some important notes:
Normally you would use a separate class as your content handler, probably one for the "main" document (including SOAP elements), and one for your "target" document (in the CDATA section). I've not done so here just to keep it as short as possible.
I'm not sure what format your XML is in, but I'm assuming it's in an InputStream here. The InputSource class will happily use an InputStream, a Reader or a String specifying a filename to read from. Use whatever suits you best.
You will need to use a SAX2 reader to be able to handle CDATA content. Your default SAX reader may or may not be SAX2 compliant. As such, you may need to (for example) manually create an instance of a particular SAX2 parser. You can find a list of some SAX2 parsers here, if that's the case.
There are probably more efficient ways of doing this too (StringBuffer/StringBuilder might be options). Again, I'm just doing it this way for simplicity.
I've not actually tested this code. Your mileage may vary.
If you've not used SAX before, it's probably also worth running through the SAX Quickstart Guide.

How to change values of some elements and attributes in an XML file [Java]?

I'm reading an XML file with a SAX-parser (this part can be changed it there's a good reason for it).
When I find necessary properties I need to change their values and save the resulting XML-file as a new file.
How can I do that?
Afaik, SAX is parser only. You must choose a different library to write XML.
If you are only changing attributes or changing element names and NOT changing structure of XML, then this should be relatively easy task. Use STaX as a writer:
// Start STaX
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
Now, extend the SAX DefaultHandler:
startDocument(){
writer.writeStartDocument("UTF-8", "1.0");
}
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) {
writer.writeStartElement(namespaceURI, localName);
for(int i=0; i<atts.getLength(); i++){
writer.writeAttribute(atts.getQName(i), atts.getValue(i));
}
}
public void endElement(String uri, localName, qName){
writer.writeEndElement();
}
If your document is relatively small, I'd recommend using JDOM. You can instantiate a SaxBuilder to create the Document from an InputStream, then use Xpath to find the node/attributes you want to change, make your modifications, and then use XmlOutputter to write the modified document back out.
On the other hand, if your document is too large to effectively hold in memory (or you'd prefer not to use a 3rd party library), you'll want to stick with your the SAX parser, streaming out the nodes to disk as you read them, making any changes on the way.
You may also want to take a look at XSLT.

How can I parse a namespace using the SAX parser?

Using a twitter search URL ie. http://search.twitter.com/search.rss?q=android returns CSS that has an item that looks like:
<item>
<title>#UberTwiter still waiting for #ubertwitter android app!!!</title>
<link>http://twitter.com/meals69/statuses/21158076391</link>
<description>still waiting for an app!!!</description>
<pubDate>Sat, 14 Aug 2010 15:33:44 +0000</pubDate>
<guid>http://twitter.com/meals69/statuses/21158076391</guid>
<author>Some Twitter User</author>
<media:content type="image/jpg" height="48" width="48" url="http://a1.twimg.com/profile_images/756343289/me2_normal.jpg"/>
<google:image_link>http://a1.twimg.com/profile_images/756343289/me2_normal.jpg</google:image_link>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
</item>
Pretty simple. My code parses out everything (title, link, description, pubDate, etc.) without any problems. However, I'm getting null on:
<google:image_link>
I'm using Java to parse the RSS feed. Do I have to handle compound localnames differently than I would a more simple localname?
This is the bit of code that parses out Link, Description, pubDate, etc:
#Override
public void endElement(String uri, String localName, String name)
throws SAXException {
super.endElement(uri, localName, name);
if (this.currentMessage != null){
if (localName.equalsIgnoreCase(TITLE)){
currentMessage.setTitle(builder.toString());
} else if (localName.equalsIgnoreCase(LINK)){
currentMessage.setLink(builder.toString());
} else if (localName.equalsIgnoreCase(DESCRIPTION)){
currentMessage.setDescription(builder.toString());
} else if (localName.equalsIgnoreCase(PUB_DATE)){
currentMessage.setDate(builder.toString());
} else if (localName.equalsIgnoreCase(GUID)){
currentMessage.setGuid(builder.toString());
} else if (uri.equalsIgnoreCase(AVATAR)){
currentMessage.setAvatar(builder.toString());
} else if (localName.equalsIgnoreCase(ITEM)){
messages.add(currentMessage);
}
builder.setLength(0);
}
}
startDocument looks like:
#Override
public void startDocument() throws SAXException {
super.startDocument();
messages = new ArrayList<Message>();
builder = new StringBuilder();
}
startElement looks like:
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
super.startElement(uri, localName, name, attributes);
if (localName.equalsIgnoreCase(ITEM)){
this.currentMessage = new Message();
}
}
Tony
An element like <google:image_link> has the local name image_link belonging to the google namespace. You need to ensure that the XML parsing framework is aware of namespaces, and you'd then need to find this element using the appropriate namespace.
For example, a few SAX1 interfaces in package org.xml.sax has been deprecated, replaced by SAX2 counterparts that include namespace support (e.g. SAX1 Parser is deprecated and replaced by SAX2 XMLReader). Consult the documentation on how to specify the namespace uri or qualified (prefixed) qName.
See also
Wikipedia/XML namespace
package org.xml.sax
saxproject.org - Namespaces
From sample it is not actually clear what namespace that 'google' prefix binds to -- previous answer is slightly incorrect in that it is NOT in "google" namespace; rather, it is a namespace that prefix "google" binds to. As such you have to match the namespace (identified by URI), and not prefix. SAX does have confusing way of reporting local name / namespace-prefix combinations, and it depends on whether namespace processing is even enabled.
You could also consider alternative XML processing libraries / APIs; while SAX implementations are performant, there are as fast and more convenient alternatives. Stax (javax.xml.stream.*) implementations like Woodstox (and even default one that JDK 1.6 comes with) are fast and bit more convenient. And StaxMate library that builds on top of Stax is much simpler to use for both reading and writing, and speedwise as fast as SAX implementations like Xerces. Plus Stax API has less baggage wrt namespace handling so it is easier to see what is the actual namespace of elements.
Like user polygenelubricants said: generally the parser needs to be namespace aware to handle elements which belong to some prefixed namespace. (Like that <google:image_link> element.)
This needs to be set as a "parser feature" which AFAIK can be done in few different ways: The XMLReader interface itself has method setFeature() that can be used to set features for a certain parser but you can also use same method for SAXParserFactory class so that this factory generates parsers with those features already on or off. SAX2 standard feature flags should be on SAXproject's website but at least some of them are also listed in Java API documentation of package org.xml.sax.
For simple documents you can try to take a shortcut. If you don't actually care about namespaces and element names as in a URL + local-name combination, and you can trust that the elements you are looking for (and only these) always have certain prefix and that there aren't elements from other namespaces with same local name then you might just solve your problem by using qname parameter of startElement() method instead of localName or vice versa or by adding/dropping the prefix from the tag name string you compare to.
The contents of parameters namespaceUri, qname or localName is according to Java specs actually optional and AFAIK they might be null for this reason. Which ones of them are null depends on what are those aforementioned "parser features" that affect namespaces. I don't know can the parameter that is null vary between elements in a namespace and elements without a namespace - I haven't investigated that behaviour.
PS. XML is case sensitive. So ideally you don't need to ignore case in tag name string comparison.-First post, yay!
Might help someone using the Android SAX util. I was trying geo:lat to get the lat element form the geo namepace.
Sample XML:
<item>
<title>My Item title</title>
<geo:lat>40.720741</geo:lat>
</item>
First attempt returned null:
item.getChild("geo:lat");
As suggested above, I found passing the namespace URI to the getChild method worked.
item.getChild("http://www.w3.org/2003/01/geo/wgs84_pos#", "lat");
Using startPrefixMapping method of my xml handler I was able to parse out text of a namespace.
I placed several calls to this method beneath my handler instantiation.
GoogleReader xmlhandler = new GoogleReader();
xmlhandler.startPrefixMapping("dc", "http://purl.org/dc/elements/1.1/");
where dc is the namespace <dc:author>some text</dc:author>

Jumping between XML tags

This is a doubt in SAX.
I want to process the children tags in a XML file,only if it matches the parent tag.
For ex:
<version>
<parent tag-1>
<tag 1>
<tag 2>
</parent tag-1 >
<parent tag-2>
<tag 1>
<tag 2>
</parent tag-2>
</version>
In the above code, I want to match the parent tag first (i.e parent tag-1 or parent tag``-2,based on user input) and only then process the children tags under it.
Can this be done in SAX parser, keeping in mind that SAX has limited control over DOM and that I am a novice in both SAX and Java? If so, could you please quote the corresponding method?
TIA
Surely, it can be done easily by remembering the parent tag.
In general, when parsing xml tags, people use stack to keep track of the family map of those tags. Your case can be solved easily with the following code:
Stack<Tag> tagStack = new Stack<Tag>();
public void startElement(String uri, String localName, String qName,
Attributes attributes)
if(localName.toLowerCase().equals("parent")){
tagStack.push(new ParentTag());
}else if(localName.toLowerCase().equals("tag")){
if(tagStack.peek() instanceof ParentTag){
//do your things here only when the parent tag is "parent"
}
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException{
if(localName.toLowerCase().equals("parent")){
tagStack.pop();
}
}
Or you can simply remember you are in what tag by updating tagname:
String tagName = null;
public void startElement(String uri, String localName, String qName,
Attributes attributes)
if(localName.toLowerCase().equals("parent")){
tagName = "parent";
}else if(localName.toLowerCase().equals("tag")){
if(tagName!= null && tagName.equals("parent")){
//do your things here only when the parent tag is "parent"
}
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException{
tagName = null;
}
But I prefer the stack way, because it keeps track of all your ancestor tags.
SAX is going to spool through the entire document anyway, if you're looking at doing this for performance reasons.
However, from a code niceness perspective, you could have the SAX parser not return the non-matching children, by wiring it up with an XMLFilter. You'd probably still have to write the logic yourself - something like that provided in Wing C. Chen's post - but instead of putting it on your application logic you could abstract it out into a filter implementation.
This would let you reuse the filtering logic more easily, and it would probably make your application code cleaner and easier to follow.
The solution proposed by #Wing C. Chen is more than decent, but in your case, I wouldn't use a stack.
A use case for a stack when parsing XML
A common use case for a stack and XML is for example verifying that XML tags are balanced, when using your own lexer(i.e. hand made XML parser with error tolerance).
A concrete example of it would be building the outline of an XML document for the Eclipse IDE.
When to use SAX, Pull parsers and alike
Memory efficiency when parsing a huge XML file
You don't need to navigate back and forth in the document.
However Using SAX to parse complex documents can become tedious, especially if you want to apply operations to nodes based on some conditions.
When to use DOM like APis
You want easy access to the nodes
You want to navigate back and forth in the document at any time
Speed is not the main requirement vs development time/readability/maintenance
My recommendation
If you don't have a huge XML, use a DOM like API and select the nodes with XPath.
I prefer Dom4J personally, but I don't mind other APis such as JDom or even Xpp3 which has XPath support.
The SAX Parser will call a method in your implementation, every time it hits a tag. If you want different behavior depending on the parent, you have to save it to a variable.
If you want to jump to particular tags then you would need to use a DOM parser. This will read the entire document into memory and then provide various ways of accessing particular nodes of the tree, such as requesting a tag by name then asking for the children of that tag.
So if you are not restricted to SAX then I would recommend DOM. I think the main reason for using SAX over DOM is that DOM requires more memory since the entire document is loaded at once.

Categories