SAX - Read HTML content without CDATA

SAX - Read HTML content without CDATA - java

I´m using SAX parser in Java and it's mandatory. I need to parse an XML with HTML tags that I must read like content, and I can´t use CDATA because I can´t change the XML file. The XML file is something like that:
<start id="123">
<tag1>text1</tag1>
<tag2>
This is an example
<span>
text inside an HTML tag
</span>
<p>
ABCDEFG<b>HIJK</b>LMNOP
</p>
</tag2>
</start>
What I need is that when I get the content of tag2, the content must be:
This is an example
<span>text inside an HTML tag</span>
<p>ABCDEFG<b>HIJK</b>LMNOP</p>
This is a test that I did and the content doesn´t show the HTML tags:
boolean istag2 = false;
StringBuilder text = new StringBuilder();
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equals("tag2")) {
istag2 = true;
}
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equals("tag2")) {
istag2 = false;
String fullText = text.toString();
System.out.println("tag2 full_text: " + fullText);
}
}
public void characters(char ch[], int start, int length) throws SAXException {
if (istag2) {
text.append(new String(ch, start, length));
}
}
Thanks in advance

OK, I think I might understand where your expectations are wrong. I think you might be expecting that the strings "<span>" and "<p>" are passed to your application by calls on the characters() method. But that's not what happens: they are passed by calls on startElement() and endElement(). If you want to build up a string containing these tags in lexical form, you will need to do something like:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start Element :" + qName);
if (qName.equals("tag2")) {
inTag2 = true;
} else if (inTag2) {
text.append("<" + qName);
// TODO: serialize any attributes
text.append(">")
}
}

Related

Create hierarchy in Sax Parser JAVA

I have done a sax parser that parses a xml file and prints the tags on the console.
The problem is that they don't follow a hierarchy.
Look at this:
-------------------<GOT>
-------------------<character>
-------------------<id>
-------------------<name>
----------------------->Arya Stark
-------------------<gender>
----------------------->Female
-------------------<culture>
----------------------->Northmen
-------------------<born>
----------------------->In 289 AC, at Winterfell
-------------------<died>
-------------------<alive>
----------------------->TRUE
-------------------<titles>
-------------------<title>
----------------------->Princess
For example, character and id are on the same level. Any idead on how to detect if a tag is a child of another?
Thanks!
public class Sax extends DefaultHandler {
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
System.out.println("-------------------<" + qName + ">");
}
public void characters(char ch[], int start, int length)
throws SAXException {
if( new String(ch,start,length).matches(".*[a-zA-Z0-9]+.*")){
System.out.println("----------------------->" + new String(ch, start, length));
} else {
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
System.out.println("</" + qName + ">");
}
}
This is the code of the sax parser, I need to know a way to detect if a tag has a child.
I am currently reading about sax parser, so if I find out I will post it!
package sax;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
public class ParseXMLFileSax {
private static final String xmlFilePath = "got.xml";
public static void main(String argv[]) {
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(xmlFilePath, new Sax());
} catch (Exception e) {
e.printStackTrace();
}
}
}
This class does the parser and calls newSaxParser class.

SAX just a stream of events, so you should somehow maintain handler state to implement your desired logic. E.g. here there is a bunch of boolean flags
How can I parse nested elements in SAX Parser in java?
In your question is not clear what's exactly your goal.
If you just want to indent tags in output, you could have a integer variable for indentation, so you could increment it on element start and decrement it on element end.
Try to find some tutorial and follow it, e.g. here https://www.informit.com/articles/article.aspx?p=26351&seqNum=5

Parsing Mixed-Content XML with SAX

I have a sample mixed-content XML document (structure cannot be modified):
<items>
<item> ABC123 <status>UPDATE</status>
<units>
<unit Description="Each ">EA <saleprice>2.99</saleprice>
<saleprice2/>
</unit>
</units>
<warehouses>
<warehouse>100<availability>2987.000</availability>
</warehouse>
</warehouses>
</item>
</items>
I am attempting to use SAX parser on this XML document, but the mixed-content elements are causing some issues. Namely, I get an empty String returned when attempting to handle the <item/> node.
My handler:
#Override
public void startElement(final String uri,
final String localName, final String qName, final Attributes attributes) throws SAXException {
final String fixedQName = qName.toLowerCase();
switch (fixedQName) {
case "item":
prod = new Product();
//prod.setItem(content); <-- doesn't work, content is empty since element just started
break;
}
}
#Override
public void endElement(final String uri, final String localName, final String qName) throws SAXException {
final String fixedQName = qName.toLowerCase();
switch (fixedQName) {
case "item":
prod.setItem(content); // <-- doesn't work either, only returns an empty string
// end element, set item
productList.add(prod);
break;
case "status":
prod.setStatus(content);
break;
// ... etc....
}
}
#Override
public void characters(final char[] ch, final int start, final int length) throws SAXException {
content = "";
content = String.copyValueOf(ch, start, length).trim();
}
This handler works correctly for everything of interest, except the <item/> element. It always returns an empty string.
If I add a println() to the characters() method to print out the content, I can see the parser eventually does print the contents of <item/>, however it is after it is expected (on the next additional characters() method invocation by the parser)
Referencing http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html, I know I should attempt to aggregate the strings returned from characters(), however I don't see how this can be since I do need to retrieve the other element's data, and hard-coding an exception for the first element into the characters() method seems like the wrong approach.
Howe can I use SAX to retrieve the mixed-content <item/>'s data 'ABC123'?

If the item content is only made of the text before the opening tag of the status element then you could get the item content in startElement:
public void startElement(final String uri,
final String localName, final String qName, final Attributes attributes) throws SAXException {
final String fixedQName = qName.toLowerCase();
switch (fixedQName) {
case "item":
prod = new Product();
break;
case "status":
prod.setItem(content);
break;
}
}
To understand consider the flow of events:
startElement item
characters "ABC123"
startElement status
characters "UPDATE"
endElement status
characters ""
endElement item

Why some characters are missing when i parse a xml tag using SaxParser?

I am parsing a xml response which has almost 90000 characters in my android application using SaxParser. xml looks like following:
<Registration>
<Client>
<Name>John</Name>
<ID>1</ID>
<Date>2013:08:22T03:43:44</Date>
</Client>
<Client>
<Name>James</Name>
<ID>2</ID>
<Date>2013:08:23T16:28:00</Date>
</Client>
<Client>
<Name>Eric</Name>
<ID>3</ID>
<Date>2013:08:23T19:04:15</Date>
</Client>
.....
</Registration>
sometimes parser misses some characters from Date tag. Instead of giving 2013:08:23T19:04:15 back it gives 2013:08:23T back. I tried to skip all white spaces from response xml string using following line of code:
responseStr = responseStr.replaceAll("\\s","");
But then i get following exception:
Parsing exception: org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 16: not well-formed (invalid token)
Following is the code i am using for parsing:
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
tagName = qName;
}
public void endElement(String uri, String localName, String qName) throws SAXException {
}
public void characters(char ch[], int start, int length) throws SAXException {
if(tagName.equals("Name")){
obj = new RegisteredUser();
String str = new String(ch, start, length);
obj.setName(str);
}else if(tagName.equals("ID")){
String str = new String(ch, start, length);
obj.setId(str);
}else if(tagName.equals("Date")){
String str = new String(ch, start, length);
obj.setDate(str);
users.add(obj);
}
}
public void startDocument() throws SAXException {
System.out.println("document started");
}
public void endDocument() throws SAXException {
System.out.println("document ended");
}
};
saxParser.parse(new InputSource(new StringReader(resp)), handler);
}catch(Exception e){
System.out.println("Parsing exception: "+e);
System.out.println("exception");
}
Any idea why is parser skipping characters from a tag and how can i solve this problem. Thanks in advance.

It's possible that characters is called more than once for any given text node.
In that case you'll have to concatenate the result yourself!
The reason for this is when some internal buffer of the parser ends while there's still content of the text node. Instead of enlarging the buffer (which could require a lot of memory when the text node is large), it let's that be handled by the client code.
You want something like that:
StringBuilder textContent = new StringBuilder();
public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
tagName = qName;
textContent.setLength(0);
}
public void characters(char ch[], int start, int length) throws SAXException {
textContent.append(ch, start, length);
}
public void endElement(String uri, String localName, String qName) throws SAXException {
String text = textContent.toString();
// handle text here
}
Of course this code can be improved to only track the text content for nodes you actually care about.

As other mentioned characters method may be called multiple times, its upto the SAX parsers implementation to return all contiguous character data in a single chunk, or they may split it into several chunks.
See the docs SAX Parser characters

You're incorrectly assuming that all the characters in a text node will be read at once and sent to the characters() method. It's not the case. The characters() method can be called multiple times for a single text node.
You should append all the chars to a StringBuilder and then only convert to a String or Date when endElement() is called.

Is there a way to use the Visitor pattern using a SAX Parser?

I'm curious about this: if I need to use a Sax parser to boost up efficiency (it's a big file). Usually I use something like this:
public class Example extends DefaultHandler
{
private Stack stack = new Stack ();
public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException
{
stack.push (qName);
}
public void endElement (String uri, String local, String qName) throws SAXException
{
if ("line".equals (qName))
System.out.println ();
stack.pop ();
}
public void characters (char buf [], int offset, int length) throws SAXException
{
if (!"line".equals (stack.peek ()))
return;
System.out.write (new String (buf, offset, length));
}
}
example taken from here.
The Sax is already an implementation of a Visitor Pattern but in my case I just need to take the content of every element and do something with it according to the nature of the element itself.
My typical XML file is something like:
<?xml version="1.0" encoding="utf-8"?>
<labs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<auth>
<uid> </uid>
<gid> </gid>
<key> </key>
</auth>
<campaign>
<sms>
<newsletter>206</newsletter>
<message>
<from>Da Definire</from>
<subject>Da definire</subject>
<body><![CDATA[Testo Da Definire]]></body>
</message>
<delivery method="manual"></delivery>
<recipients>
<db>276</db>
<filter>
<test>1538</test>
</filter>
<new_recipients>
<csv_file>Corso2012_SMS.csv</csv_file>
</new_recipients>
</recipients>
</sms>
</campaign>
</labs>
When I'm in the csv_file node I need to take the filename and upload users from that file, if I'm in the filter/test I need to check if the filter exists and so on.
Is there a way to apply the Visitor Pattern with SAX?

You could simply have a Map<String, ElementHandler> in your SAX parser, and allow registering ElementHandlers for element names. Supposing that you're only interested in leaf elements:
each time an element starts, you look if there is a handler for this element name in the map, and you clear a buffer.
each time characters() is called, you append the characters to the buffer (if there was a handler for the previous element start)
each time an element is ended, if there was a handler for the previous element start, you call the handler with the content of the buffer
Here's an example:
private ElementHandler currentHandler;
private StringBuilder buffer = new StringBuilder();
private Map<String, ElementHandler> handlers = new HashMap<String, ElementHandler>();
public void registerHandler(String qName, ElementHandler handler) {
handlers.put(qName, handler);
}
public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException {
currentHandler = handlers.get(qName);
buffer.delete(0, buffer.length());
}
public void characters (char buf [], int offset, int length) throws SAXException {
if (currentHandler != null) {
buffer.append(buf, offset, length);
}
}
public void endElement (String uri, String local, String qName) throws SAXException {
if (currentHandler != null) {
currentHandler.handle(buffer.toString();
}
}

Don't forget StAX . It probably won't make Visitor pattern any easier, but if your documents are relatively simple and you're already planning on streaming them, it does have a simpler programming model than SAX. You just iterate over the events in the parsed stream, one a time, ignoring or acting on them as you choose.

SAX parsing problem in Android... empty elements?

I am using SAX to parse an XML file I'm pulling from the web. I've extended DefaultHandler with code similar to:
public class ArrivalHandler extends DefaultHandler {
#Override
public void startElement(String namespaceUri, String localName, String qualifiedName, Attributes attributes) throws SAXException {
if (qualifiedName.equalsIgnoreCase("resultSet")) {
System.out.println("got a resultset");
} else if (qualifiedName.equalsIgnoreCase("location")) {
System.out.println("got a location");
} else if (qualifiedName.equalsIgnoreCase("arrival")) {
System.out.println("got an arrival");
} else {
System.out.println("There was an unknown XML element encountered: '" + qualifiedName + "'");
}
}
#Override
public void endElement(String namespaceUri, String localName, String qualifiedName) throws SAXException {
// we'll just ignore this for now
}
#Override
public void characters(char[] chars, int startIndex, int length) throws SAXException {
// ignore this too
}
}
The problem I'm having is that I'm just getting a series of empty elements. The log reads:
There was an unknown XML element encountered: ''
There was an unknown XML element encountered: ''
There was an unknown XML element encountered: ''
etc
This worked fine when I was just passing parser.parse a local file, but now I'm pulling it from the web with:
HttpClient httpClient = new DefaultHttpClient();
resp = httpClient.execute("http://example.com/whatever");
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
ArrivalHandler handler = new ArrivalHandler();
SAXParser parser = saxFactory.newSAXParser();
parser.parse(resp.getEntity().getContent(), handler);
and I get the (apparently) empty results described above.
What I've looked into so far:
I converted the InputStream from resp.getEntity().getContent() to a string and dumped it out and it looks like I'm getting the XML from the server correctly.
There are no exceptions thrown but there is a warning that reads "W/ExpatReader(232): DTD handlers aren't supported.".
Any other ideas for what I'm doing incorrectly or how to debug this?

From the docs for ContentHandler.startElement:
the qualified name is required when
the namespace-prefixes property is
true, and is optional when the
namespace-prefixes property is false
(the default).
So, do you have the namespace-prefixes property set to true?
Can you just cope with the uri and localName instead?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SAX - Read HTML content without CDATA - java

Related

Create hierarchy in Sax Parser JAVA

Parsing Mixed-Content XML with SAX

Why some characters are missing when i parse a xml tag using SaxParser?

Is there a way to use the Visitor pattern using a SAX Parser?

SAX parsing problem in Android... empty elements?

Categories

Resources