Create hierarchy in Sax Parser JAVA - java

I have done a sax parser that parses a xml file and prints the tags on the console.
The problem is that they don't follow a hierarchy.
Look at this:
-------------------<GOT>
-------------------<character>
-------------------<id>
-------------------<name>
----------------------->Arya Stark
-------------------<gender>
----------------------->Female
-------------------<culture>
----------------------->Northmen
-------------------<born>
----------------------->In 289 AC, at Winterfell
-------------------<died>
-------------------<alive>
----------------------->TRUE
-------------------<titles>
-------------------<title>
----------------------->Princess
For example, character and id are on the same level. Any idead on how to detect if a tag is a child of another?
Thanks!
public class Sax extends DefaultHandler {
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
System.out.println("-------------------<" + qName + ">");
}
public void characters(char ch[], int start, int length)
throws SAXException {
if( new String(ch,start,length).matches(".*[a-zA-Z0-9]+.*")){
System.out.println("----------------------->" + new String(ch, start, length));
} else {
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
System.out.println("</" + qName + ">");
}
}
This is the code of the sax parser, I need to know a way to detect if a tag has a child.
I am currently reading about sax parser, so if I find out I will post it!
package sax;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
public class ParseXMLFileSax {
private static final String xmlFilePath = "got.xml";
public static void main(String argv[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(xmlFilePath, new Sax());
} catch (Exception e) {
e.printStackTrace();
}
}
}
This class does the parser and calls newSaxParser class.

SAX just a stream of events, so you should somehow maintain handler state to implement your desired logic. E.g. here there is a bunch of boolean flags
How can I parse nested elements in SAX Parser in java?
In your question is not clear what's exactly your goal.
If you just want to indent tags in output, you could have a integer variable for indentation, so you could increment it on element start and decrement it on element end.
Try to find some tutorial and follow it, e.g. here https://www.informit.com/articles/article.aspx?p=26351&seqNum=5

Related

SAX - Read HTML content without CDATA

I´m using SAX parser in Java and it's mandatory. I need to parse an XML with HTML tags that I must read like content, and I can´t use CDATA because I can´t change the XML file. The XML file is something like that:
<start id="123">
<tag1>text1</tag1>
<tag2>
This is an example
<span>
text inside an HTML tag
</span>
<p>
ABCDEFG<b>HIJK</b>LMNOP
</p>
</tag2>
</start>
What I need is that when I get the content of tag2, the content must be:
This is an example
<span>text inside an HTML tag</span>
<p>ABCDEFG<b>HIJK</b>LMNOP</p>
This is a test that I did and the content doesn´t show the HTML tags:
boolean istag2 = false;
StringBuilder text = new StringBuilder();
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equals("tag2")) {
istag2 = true;
}
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("tag2")) {
istag2 = false;
String fullText = text.toString();
System.out.println("tag2 full_text: " + fullText);
}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (istag2) {
text.append(new String(ch, start, length));
}
}
Thanks in advance
OK, I think I might understand where your expectations are wrong. I think you might be expecting that the strings "<span>" and "<p>" are passed to your application by calls on the characters() method. But that's not what happens: they are passed by calls on startElement() and endElement(). If you want to build up a string containing these tags in lexical form, you will need to do something like:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equals("tag2")) {
inTag2 = true;
} else if (inTag2) {
text.append("<" + qName);
// TODO: serialize any attributes
text.append(">")
}
}

Why some characters are missing when i parse a xml tag using SaxParser?

I am parsing a xml response which has almost 90000 characters in my android application using SaxParser. xml looks like following:
<Registration>
<Client>
<Name>John</Name>
<ID>1</ID>
<Date>2013:08:22T03:43:44</Date>
</Client>
<Client>
<Name>James</Name>
<ID>2</ID>
<Date>2013:08:23T16:28:00</Date>
</Client>
<Client>
<Name>Eric</Name>
<ID>3</ID>
<Date>2013:08:23T19:04:15</Date>
</Client>
.....
</Registration>
sometimes parser misses some characters from Date tag. Instead of giving 2013:08:23T19:04:15 back it gives 2013:08:23T back. I tried to skip all white spaces from response xml string using following line of code:
responseStr = responseStr.replaceAll("\\s","");
But then i get following exception:
Parsing exception: org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 16: not well-formed (invalid token)
Following is the code i am using for parsing:
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
tagName = qName;
}
public void endElement(String uri, String localName, String qName) throws SAXException {
}
public void characters(char ch[], int start, int length) throws SAXException {
if(tagName.equals("Name")){
obj = new RegisteredUser();
String str = new String(ch, start, length);
obj.setName(str);
}else if(tagName.equals("ID")){
String str = new String(ch, start, length);
obj.setId(str);
}else if(tagName.equals("Date")){
String str = new String(ch, start, length);
obj.setDate(str);
users.add(obj);
}
}
public void startDocument() throws SAXException {
System.out.println("document started");
}
public void endDocument() throws SAXException {
System.out.println("document ended");
}
};
saxParser.parse(new InputSource(new StringReader(resp)), handler);
}catch(Exception e){
System.out.println("Parsing exception: "+e);
System.out.println("exception");
}
Any idea why is parser skipping characters from a tag and how can i solve this problem. Thanks in advance.
It's possible that characters is called more than once for any given text node.
In that case you'll have to concatenate the result yourself!
The reason for this is when some internal buffer of the parser ends while there's still content of the text node. Instead of enlarging the buffer (which could require a lot of memory when the text node is large), it let's that be handled by the client code.
You want something like that:
StringBuilder textContent = new StringBuilder();
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
tagName = qName;
textContent.setLength(0);
}
public void characters(char ch[], int start, int length) throws SAXException {
textContent.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String text = textContent.toString();
// handle text here
}
Of course this code can be improved to only track the text content for nodes you actually care about.
As other mentioned characters method may be called multiple times, its upto the SAX parsers implementation to return all contiguous character data in a single chunk, or they may split it into several chunks.
See the docs SAX Parser characters
You're incorrectly assuming that all the characters in a text node will be read at once and sent to the characters() method. It's not the case. The characters() method can be called multiple times for a single text node.
You should append all the chars to a StringBuilder and then only convert to a String or Date when endElement() is called.

Is there a way to use the Visitor pattern using a SAX Parser?

I'm curious about this: if I need to use a Sax parser to boost up efficiency (it's a big file). Usually I use something like this:
public class Example extends DefaultHandler
{
private Stack stack = new Stack ();
public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException
{
stack.push (qName);
}
public void endElement (String uri, String local, String qName) throws SAXException
{
if ("line".equals (qName))
System.out.println ();
stack.pop ();
}
public void characters (char buf [], int offset, int length) throws SAXException
{
if (!"line".equals (stack.peek ()))
return;
System.out.write (new String (buf, offset, length));
}
}
example taken from here.
The Sax is already an implementation of a Visitor Pattern but in my case I just need to take the content of every element and do something with it according to the nature of the element itself.
My typical XML file is something like:
<?xml version="1.0" encoding="utf-8"?>
<labs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<auth>
<uid> </uid>
<gid> </gid>
<key> </key>
</auth>
<campaign>
<sms>
<newsletter>206</newsletter>
<message>
<from>Da Definire</from>
<subject>Da definire</subject>
<body><![CDATA[Testo Da Definire]]></body>
</message>
<delivery method="manual"></delivery>
<recipients>
<db>276</db>
<filter>
<test>1538</test>
</filter>
<new_recipients>
<csv_file>Corso2012_SMS.csv</csv_file>
</new_recipients>
</recipients>
</sms>
</campaign>
</labs>
When I'm in the csv_file node I need to take the filename and upload users from that file, if I'm in the filter/test I need to check if the filter exists and so on.
Is there a way to apply the Visitor Pattern with SAX?
You could simply have a Map<String, ElementHandler> in your SAX parser, and allow registering ElementHandlers for element names. Supposing that you're only interested in leaf elements:
each time an element starts, you look if there is a handler for this element name in the map, and you clear a buffer.
each time characters() is called, you append the characters to the buffer (if there was a handler for the previous element start)
each time an element is ended, if there was a handler for the previous element start, you call the handler with the content of the buffer
Here's an example:
private ElementHandler currentHandler;
private StringBuilder buffer = new StringBuilder();
private Map<String, ElementHandler> handlers = new HashMap<String, ElementHandler>();
public void registerHandler(String qName, ElementHandler handler) {
handlers.put(qName, handler);
}
public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException {
currentHandler = handlers.get(qName);
buffer.delete(0, buffer.length());
}
public void characters (char buf [], int offset, int length) throws SAXException {
if (currentHandler != null) {
buffer.append(buf, offset, length);
}
}
public void endElement (String uri, String local, String qName) throws SAXException {
if (currentHandler != null) {
currentHandler.handle(buffer.toString();
}
}
Don't forget StAX . It probably won't make Visitor pattern any easier, but if your documents are relatively simple and you're already planning on streaming them, it does have a simpler programming model than SAX. You just iterate over the events in the parsed stream, one a time, ignoring or acting on them as you choose.

xml parsing using SAXParser

I am working with one application in which SAXparsing is placed. To get the City & State name from latitude and longitude I'm using Google API. Google API url google api
I want to get long_name short_name & type of header Tag address_component .
All the information I am getting successfully from this XML but problem is that when I am trying to get type Tag value . There are Two type tag in this header and I am always getting second type tag value .
Sample XML:
<address_component>
<long_name>Gujarat</long_name>
<short_name>Gujarat</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
How can I get type Tag value is administrative_area_level_1 as well as political?
I came across the following link which is really easy to give a start-
http://javarevisited.blogspot.com/2011/12/parse-read-xml-file-java-sax-parser.html
I add the data into one file named as location.xml(if you get this from web do your own logic for getting data after getting that data convert into Inputstream pass it to following code) i wrote a method in that you can get it
public void ReadAndWriteXMLFileUsingSAXParser(){
try
{
DefaultHandler handler = new MyHandler();
// parseXmlFile("infilename.xml", handler, true);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
InputStream rStream = null;
rStream = getClass().getResourceAsStream("location.xml");
saxParser.parse(rStream, handler);
}catch (Exception e)
{
System.out.println(e.getMessage());
}
}
This is MyHandler class. your final data stored into one vector called as "data"
class MyHandler extends DefaultHandler {
String rootname;Attributes atr;
private boolean flag=false;private Vector data;
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) {
rootname=localName;
atr=atts;
if(rootname.equalsIgnoreCase("address_component")){
data=new Vector();
flag=true;
}
}
public void characters(char[] ch, int start, int length){
String value=new String(ch,start,length);
if(flag)
{
if(rootname.equalsIgnoreCase("type")){
data.addElement(value) ;
System.out.println("++++++++++++++"+value);
}
if(rootname.equalsIgnoreCase("long_name")){
data.addElement(value) ;
System.out.println("++++++++++++++"+value);
}
if(rootname.equalsIgnoreCase("short_name")){
data.addElement(value) ;
System.out.println("++++++++++++++"+value);
}
}
}
public void endElement(String uri, String localName, String qName){
rootname=localName;
if(rootname.equalsIgnoreCase("address_component")){
flag=false;
}
}
}
you can find all data into the data vector and also you can find the data onconsole
as
++++++++++++++Gujarat
++++++++++++++Gujarat
++++++++++++++administrative_area_level_1
++++++++++++++political
Read this tutorial. This will help you to parse xml file using sax parser.

SAX parsing problem in Android... empty elements?

I am using SAX to parse an XML file I'm pulling from the web. I've extended DefaultHandler with code similar to:
public class ArrivalHandler extends DefaultHandler {
#Override
public void startElement(String namespaceUri, String localName, String qualifiedName, Attributes attributes) throws SAXException {
if (qualifiedName.equalsIgnoreCase("resultSet")) {
System.out.println("got a resultset");
} else if (qualifiedName.equalsIgnoreCase("location")) {
System.out.println("got a location");
} else if (qualifiedName.equalsIgnoreCase("arrival")) {
System.out.println("got an arrival");
} else {
System.out.println("There was an unknown XML element encountered: '" + qualifiedName + "'");
}
}
#Override
public void endElement(String namespaceUri, String localName, String qualifiedName) throws SAXException {
// we'll just ignore this for now
}
#Override
public void characters(char[] chars, int startIndex, int length) throws SAXException {
// ignore this too
}
}
The problem I'm having is that I'm just getting a series of empty elements. The log reads:
There was an unknown XML element encountered: ''
There was an unknown XML element encountered: ''
There was an unknown XML element encountered: ''
etc
This worked fine when I was just passing parser.parse a local file, but now I'm pulling it from the web with:
HttpClient httpClient = new DefaultHttpClient();
resp = httpClient.execute("http://example.com/whatever");
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
ArrivalHandler handler = new ArrivalHandler();
SAXParser parser = saxFactory.newSAXParser();
parser.parse(resp.getEntity().getContent(), handler);
and I get the (apparently) empty results described above.
What I've looked into so far:
I converted the InputStream from resp.getEntity().getContent() to a string and dumped it out and it looks like I'm getting the XML from the server correctly.
There are no exceptions thrown but there is a warning that reads "W/ExpatReader(232): DTD handlers aren't supported.".
Any other ideas for what I'm doing incorrectly or how to debug this?
From the docs for ContentHandler.startElement:
the qualified name is required when
the namespace-prefixes property is
true, and is optional when the
namespace-prefixes property is false
(the default).
So, do you have the namespace-prefixes property set to true?
Can you just cope with the uri and localName instead?

Categories