How can I parse a namespace using the SAX parser?

How can I parse a namespace using the SAX parser? - java

Using a twitter search URL ie. http://search.twitter.com/search.rss?q=android returns CSS that has an item that looks like:
<item>
<title>#UberTwiter still waiting for #ubertwitter android app!!!</title>
<link>http://twitter.com/meals69/statuses/21158076391</link>
<description>still waiting for an app!!!</description>
<pubDate>Sat, 14 Aug 2010 15:33:44 +0000</pubDate>
<guid>http://twitter.com/meals69/statuses/21158076391</guid>
<author>Some Twitter User</author>
<media:content type="image/jpg" height="48" width="48" url="http://a1.twimg.com/profile_images/756343289/me2_normal.jpg"/>
<google:image_link>http://a1.twimg.com/profile_images/756343289/me2_normal.jpg</google:image_link>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
</item>
Pretty simple. My code parses out everything (title, link, description, pubDate, etc.) without any problems. However, I'm getting null on:
<google:image_link>
I'm using Java to parse the RSS feed. Do I have to handle compound localnames differently than I would a more simple localname?
This is the bit of code that parses out Link, Description, pubDate, etc:
#Override
public void endElement(String uri, String localName, String name)
throws SAXException {
super.endElement(uri, localName, name);
if (this.currentMessage != null){
if (localName.equalsIgnoreCase(TITLE)){
currentMessage.setTitle(builder.toString());
} else if (localName.equalsIgnoreCase(LINK)){
currentMessage.setLink(builder.toString());
} else if (localName.equalsIgnoreCase(DESCRIPTION)){
currentMessage.setDescription(builder.toString());
} else if (localName.equalsIgnoreCase(PUB_DATE)){
currentMessage.setDate(builder.toString());
} else if (localName.equalsIgnoreCase(GUID)){
currentMessage.setGuid(builder.toString());
} else if (uri.equalsIgnoreCase(AVATAR)){
currentMessage.setAvatar(builder.toString());
} else if (localName.equalsIgnoreCase(ITEM)){
messages.add(currentMessage);
}
builder.setLength(0);
}
}
startDocument looks like:
#Override
public void startDocument() throws SAXException {
super.startDocument();
messages = new ArrayList<Message>();
builder = new StringBuilder();
}
startElement looks like:
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
super.startElement(uri, localName, name, attributes);
if (localName.equalsIgnoreCase(ITEM)){
this.currentMessage = new Message();
}
}
Tony

An element like <google:image_link> has the local name image_link belonging to the google namespace. You need to ensure that the XML parsing framework is aware of namespaces, and you'd then need to find this element using the appropriate namespace.
For example, a few SAX1 interfaces in package org.xml.sax has been deprecated, replaced by SAX2 counterparts that include namespace support (e.g. SAX1 Parser is deprecated and replaced by SAX2 XMLReader). Consult the documentation on how to specify the namespace uri or qualified (prefixed) qName.
See also
Wikipedia/XML namespace
package org.xml.sax
saxproject.org - Namespaces

From sample it is not actually clear what namespace that 'google' prefix binds to -- previous answer is slightly incorrect in that it is NOT in "google" namespace; rather, it is a namespace that prefix "google" binds to. As such you have to match the namespace (identified by URI), and not prefix. SAX does have confusing way of reporting local name / namespace-prefix combinations, and it depends on whether namespace processing is even enabled.
You could also consider alternative XML processing libraries / APIs; while SAX implementations are performant, there are as fast and more convenient alternatives. Stax (javax.xml.stream.*) implementations like Woodstox (and even default one that JDK 1.6 comes with) are fast and bit more convenient. And StaxMate library that builds on top of Stax is much simpler to use for both reading and writing, and speedwise as fast as SAX implementations like Xerces. Plus Stax API has less baggage wrt namespace handling so it is easier to see what is the actual namespace of elements.

Like user polygenelubricants said: generally the parser needs to be namespace aware to handle elements which belong to some prefixed namespace. (Like that <google:image_link> element.)
This needs to be set as a "parser feature" which AFAIK can be done in few different ways: The XMLReader interface itself has method setFeature() that can be used to set features for a certain parser but you can also use same method for SAXParserFactory class so that this factory generates parsers with those features already on or off. SAX2 standard feature flags should be on SAXproject's website but at least some of them are also listed in Java API documentation of package org.xml.sax.
For simple documents you can try to take a shortcut. If you don't actually care about namespaces and element names as in a URL + local-name combination, and you can trust that the elements you are looking for (and only these) always have certain prefix and that there aren't elements from other namespaces with same local name then you might just solve your problem by using qname parameter of startElement() method instead of localName or vice versa or by adding/dropping the prefix from the tag name string you compare to.
The contents of parameters namespaceUri, qname or localName is according to Java specs actually optional and AFAIK they might be null for this reason. Which ones of them are null depends on what are those aforementioned "parser features" that affect namespaces. I don't know can the parameter that is null vary between elements in a namespace and elements without a namespace - I haven't investigated that behaviour.
PS. XML is case sensitive. So ideally you don't need to ignore case in tag name string comparison.-First post, yay!

Might help someone using the Android SAX util. I was trying geo:lat to get the lat element form the geo namepace.
Sample XML:
<item>
<title>My Item title</title>
<geo:lat>40.720741</geo:lat>
</item>
First attempt returned null:
item.getChild("geo:lat");
As suggested above, I found passing the namespace URI to the getChild method worked.
item.getChild("http://www.w3.org/2003/01/geo/wgs84_pos#", "lat");

Using startPrefixMapping method of my xml handler I was able to parse out text of a namespace.
I placed several calls to this method beneath my handler instantiation.
GoogleReader xmlhandler = new GoogleReader();
xmlhandler.startPrefixMapping("dc", "http://purl.org/dc/elements/1.1/");
where dc is the namespace <dc:author>some text</dc:author>

Related

Most efficient way to read and edit an xml file

I have an xml template file, some fields are blank and need to be filled by my application.
This has to result in an xml string representation of that file which will be given to another routine.
So, let's take this simple xml as example:
<root>
<name anAttr=""></name>
<age></age>
</root>
As you can see I'd have to read the xml and, in the parsing process, add some contents to it.
I though about using a sax parser and in the handler I would do something like this:
StringBuilder finalXml = new StringBuilder();
DefaultHandler handler = new DefaultHandler(){
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
finalXml.append("<"+qName+">");
if(qName.equals("name")){
finalXml.append("donald");
}
}
would it be correct/efficient this way? Or is there a better way?

I've used dom4j when i have wanted to parse xml in Java, and it's quite efficient.

If you have a choice of technology then I would suggest using JAXB .
It will unmarshal the XML into Java Object ,here do the modifications to java Object and then Marshal the modified Java Object into new XML File.
It has little bit of learning curve but code will be readable and maintainable.
for Basic tutorial of JAXB please refer to URL

java sax parser mangles attributes for xml 1.1

I'm using java's sax classes to parse an xml file. If the xml file says version 1.0, everything works fine, but if it says version 1.1, then some of the attributes get mangled, giving me the wrong results but not throwing any kind of exception.
My xml file basically looks like this:
<?xml version="1.1" encoding="UTF-8" ?>
<gpx>
<trk>
<name>Name of the track</name>
<trkseg>
<trkpt lat="12.3456789" lon="1.2345678">
<ele>1234</ele>
<time>2013-03-26T12:34:56Z</time>
<speed>0</speed>
</trkpt>
... and then 419 further identical copies of this trkpt
</trkseg>
</trk>
</gpx>
So what I expect, when I use sax to parse this file, is to find 420 trkpt tags, and for each of them to have lat and lon attributes. In particular, I expect to find 420 "lat" attributes which are all "12.3456789".
For the parsing I construct a handler object and give it the stream to this local file:
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
inStream = new FileInputStream(file);
saxParser.parse(inStream, handler);
System.out.println("done");
The handler class extends org.xml.sax.helpers.DefaultHandler and just has one method, startElement to react to the opening of the trkpt tag:
public void startElement(String uri, String localName, String qName, Attributes attributes)
{
if (qName.equals("trkpt") && attributes != null
&& attributes.getLength() == 2
&& attributes.getValue(0).charAt(0) != '1')
{
// The trkpt tag has two attributes
// but the value of the first one doesn't begin with '1'
System.out.println(attributes.getQName(0) + " = " + attributes.getValue(0));
}
super.startElement(uri, localName, qName, attributes);
}
So what is the result?
If the xml file has version 1.0, then all I see is "done". 420 trkpt tags were found, all of them had two attributes, the first one was always called "lat" and the value of this attribute always started with '1', as I expect. Great!
If the xml file is changed to specify version="1.1" on the first line, then I get the following output:
lat = :34.56Z</t
lat = :56Z</time
done
So even though all my 420 points should be identical, two of them gave me a completely wrong attribute value. No exceptions are thrown. Still 420 trkpts were found, and all had two attributes called "lat" and "lon". Oddly the lon values are always ok.
I created this xml file in a text editor by direct copy/pasting the first trkpt, so I'm sure that all the values are identical, I'm sure there are no points in the xml file with funny attribute values, and I'm sure that there are no non-ascii character values or entity codes or anything else odd about the file.
I've tried it using Sun's JRE6, OpenJDK6 and OpenJDK7, on three different machines with two different OSs. So either I'm doing something wrong, or this particular xml file is incompatible with xml1.1 somehow, or there's a widespread sax bug (which seems unlikely as I would expect it to affect lots of people). Again, please note, with xml1.0 it all works fine. Also note, there's nothing special about the number 420, it's just that if the file only has 100 entries then they all get parsed properly. If you have several thousand entries then a certain number of them get their first attribute value mangled in this way. The length of the attribute value always seems to be correct but it's pulling characters out from the wrong point in the file. Index overflow perhaps?
I tried removing all the speed tags, but the problem still persists if you have enough trkpts. It's also sensitive to additional whitespace, so the problem occurs with different points or gives back different attribute values if I add line breaks between the trkpts.

This bug has been present in the JDK XML parser for years, and neither Sun nor Oracle has showed any interest in fixing it. I strongly advise using the Apache Xerces XML parser in preference.

Using XPath in XMLObject to query by namespace

I have a simple XML document
<abc:MyForm xmlns:abc='http://myform.com'>
<abc:Forms>
<def:Form1 xmlns:def='http://decform.com'>
....
</def:Form1>
<ghi:Form2 xmlns:ghi='http://ghiform.com'>
....
</ghi:Form2>
</abc:Forms>
</abc:MyForm>
I'm using XMLObjects from Apache and when I try to do the following xpath expression it works perfectly
object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms/*");
this gives me the 2 Form nodes (def and ghi). However I want to be able to query by specifying a namespace, so let's say I only want Form2. I've tried this and it fails
object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms/*
[namespace-uri() = 'http://ghiform.com']");
The selectPath returns 0 nodes. Does anyone know what is going on?
Update:
If I do the following in 2 steps, then I can get the result that I want.
XmlObject forms = object.selectPath("declare namespace abc='http://myform.com'
abc:Form/abc:Forms")[0];
forms.selectPath("*[namespace-uri() = 'http://ghiform.com']");
this gives me the ghi:Form node just like it should, I don't understand why it doesn't do it as a single XPath expression though.
Thanks

The simple answer is that you can't. The namespace prefix is just a shorthand for the namespace URI, which is all that matters.
For a namespace-aware parser, your two tags are identical.
If you really want to differentiate using the prefix (although you really, really shouldn't be doing it), you can use a non namespace-aware parser and just treat the prefix as if it was part of the element name.
But ideally you should read a tutorial on how namespaces work and try to use them as they were designed to be used.

How to change values of some elements and attributes in an XML file [Java]?

I'm reading an XML file with a SAX-parser (this part can be changed it there's a good reason for it).
When I find necessary properties I need to change their values and save the resulting XML-file as a new file.
How can I do that?

Afaik, SAX is parser only. You must choose a different library to write XML.
If you are only changing attributes or changing element names and NOT changing structure of XML, then this should be relatively easy task. Use STaX as a writer:
// Start STaX
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
Now, extend the SAX DefaultHandler:
startDocument(){
writer.writeStartDocument("UTF-8", "1.0");
}
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) {
writer.writeStartElement(namespaceURI, localName);
for(int i=0; i<atts.getLength(); i++){
writer.writeAttribute(atts.getQName(i), atts.getValue(i));
}
}
public void endElement(String uri, localName, qName){
writer.writeEndElement();
}

If your document is relatively small, I'd recommend using JDOM. You can instantiate a SaxBuilder to create the Document from an InputStream, then use Xpath to find the node/attributes you want to change, make your modifications, and then use XmlOutputter to write the modified document back out.
On the other hand, if your document is too large to effectively hold in memory (or you'd prefer not to use a 3rd party library), you'll want to stick with your the SAX parser, streaming out the nodes to disk as you read them, making any changes on the way.
You may also want to take a look at XSLT.

Jumping between XML tags

This is a doubt in SAX.
I want to process the children tags in a XML file,only if it matches the parent tag.
For ex:
<version>
<parent tag-1>
<tag 1>
<tag 2>
</parent tag-1 >
<parent tag-2>
<tag 1>
<tag 2>
</parent tag-2>
</version>
In the above code, I want to match the parent tag first (i.e parent tag-1 or parent tag``-2,based on user input) and only then process the children tags under it.
Can this be done in SAX parser, keeping in mind that SAX has limited control over DOM and that I am a novice in both SAX and Java? If so, could you please quote the corresponding method?
TIA

Surely, it can be done easily by remembering the parent tag.
In general, when parsing xml tags, people use stack to keep track of the family map of those tags. Your case can be solved easily with the following code:
Stack<Tag> tagStack = new Stack<Tag>();
public void startElement(String uri, String localName, String qName,
Attributes attributes)
if(localName.toLowerCase().equals("parent")){
tagStack.push(new ParentTag());
}else if(localName.toLowerCase().equals("tag")){
if(tagStack.peek() instanceof ParentTag){
//do your things here only when the parent tag is "parent"
}
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException{
if(localName.toLowerCase().equals("parent")){
tagStack.pop();
}
}
Or you can simply remember you are in what tag by updating tagname:
String tagName = null;
public void startElement(String uri, String localName, String qName,
Attributes attributes)
if(localName.toLowerCase().equals("parent")){
tagName = "parent";
}else if(localName.toLowerCase().equals("tag")){
if(tagName!= null && tagName.equals("parent")){
//do your things here only when the parent tag is "parent"
}
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException{
tagName = null;
}
But I prefer the stack way, because it keeps track of all your ancestor tags.

SAX is going to spool through the entire document anyway, if you're looking at doing this for performance reasons.
However, from a code niceness perspective, you could have the SAX parser not return the non-matching children, by wiring it up with an XMLFilter. You'd probably still have to write the logic yourself - something like that provided in Wing C. Chen's post - but instead of putting it on your application logic you could abstract it out into a filter implementation.
This would let you reuse the filtering logic more easily, and it would probably make your application code cleaner and easier to follow.

The solution proposed by #Wing C. Chen is more than decent, but in your case, I wouldn't use a stack.
A use case for a stack when parsing XML
A common use case for a stack and XML is for example verifying that XML tags are balanced, when using your own lexer(i.e. hand made XML parser with error tolerance).
A concrete example of it would be building the outline of an XML document for the Eclipse IDE.
When to use SAX, Pull parsers and alike
Memory efficiency when parsing a huge XML file
You don't need to navigate back and forth in the document.
However Using SAX to parse complex documents can become tedious, especially if you want to apply operations to nodes based on some conditions.
When to use DOM like APis
You want easy access to the nodes
You want to navigate back and forth in the document at any time
Speed is not the main requirement vs development time/readability/maintenance
My recommendation
If you don't have a huge XML, use a DOM like API and select the nodes with XPath.
I prefer Dom4J personally, but I don't mind other APis such as JDom or even Xpp3 which has XPath support.

The SAX Parser will call a method in your implementation, every time it hits a tag. If you want different behavior depending on the parent, you have to save it to a variable.

If you want to jump to particular tags then you would need to use a DOM parser. This will read the entire document into memory and then provide various ways of accessing particular nodes of the tree, such as requesting a tag by name then asking for the children of that tag.
So if you are not restricted to SAX then I would recommend DOM. I think the main reason for using SAX over DOM is that DOM requires more memory since the entire document is loaded at once.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I parse a namespace using the SAX parser? - java

Related

Most efficient way to read and edit an xml file

java sax parser mangles attributes for xml 1.1

Using XPath in XMLObject to query by namespace

How to change values of some elements and attributes in an XML file [Java]?

Jumping between XML tags

Categories

Resources