Why does SaxParser fail at random?

Why does SaxParser fail at random? - java

I'm using SAX parser in my Android application to read a few feeds a time. The script is executed as follows.
// Begin FeedLezer
try {
/** Handling XML **/
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader xr = sp.getXMLReader();
/** Send URL to parse XML Tags **/
URL sourceUrl = new URL(
BronFeeds[i]);
/** Create handler to handle XML Tags ( extends DefaultHandler ) **/
Feed_XMLHandler myXMLHandler = new Feed_XMLHandler();
xr.setContentHandler(myXMLHandler);
xr.parse(new InputSource(sourceUrl.openStream()));
} catch (Exception e) {
System.out.println("XML Pasing Excpetion = " + e);
}
sitesList = Feed_XMLHandler.sitesList;
String titels = sitesList.getMergedTitles();
And here are Feed_XMLHandler.java and Feed_XMLList.java, which I basically both just took from the web.
However, this code fails at times. I'll show some examples.
http://imm.io/media/2I/2IAs.jpg
It goes very well here. It even recognizes and displays apostrophes. Even when clicking the articles open, almost all of the text shows, so that's all good. The source feed is here. I can't control the feed.
http://imm.io/media/2I/2IB1.jpg Here, it doesn't go so well. It does display the ï, but it chokes on the apostrophe (there's supposed to be 'NORAD' after the Waarom). Here
http://imm.io/media/2I/2IBQ.jpg This is the worst one. As you can see, the title only displays an apostrophe, whilst it is supposed to be a 'blablabla'. Also, the text ends in the middle of the line, without any special characters in the quote. The feed is here
In all cases, I have no control over the feed. I think the script does choke on special characters. How can I make sure SAX fetches all the strings correctly?
If anyone knows an answer to this, you really help me out a LOT :D
Thanks in advance.

This is from the FAQ of Xerces.
Why does the SAX parser lose some
character data or why is the data
split into several chunks? If you
read the SAX documentation, you will
find that SAX may deliver contiguous
text as multiple calls to characters,
for reasons having to do with parser
efficiency and input buffering. It is
the programmer's responsibility to
deal with that appropriately, e.g. by
accumulating text until the next
non-characters event.
You're code is very well adapted from one of many XML Parsing tutorials (like this one here) Now, the tutorial is good and all, but they fail to mention something very important...
Notice this part here...
public void characters(char[] ch, int start, int length)
throws SAXException
{
if(in_ThisTag){
myobj.setName(new String(ch,start,length))
}
}
I bet at this point you're checking up booleans to mark which tag you're under and then setting a value in some kind of class you made? or something like that....
But the problem is, the SAX parser (which is buffered) will not necesarily get you all the characters between a tag at one go....say if <tag> Lorem Ipsum...really long sentence...</tag> so your SAX parser calls characters function is chunks....
So the trick here, is to keep appending the values to a string variable and the actually set (or commit) it to your structure when the tag ends...(ie in endElement)
Example
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
currentElement = false;
/** set value */
if (localName.equalsIgnoreCase("tag"))
{
sitesList.setName(currentValue);
currentValue = ""; //reset the currentValue
}
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if (in_Tag) {
currentValue += new String(ch, start, length); //keep appending string, don't set it right here....maybe there's more to come.
}
}
Also, it would be better if you use StringBuilder for the appending, since that'll be more efficient....
Hope it makes sense! If it didn't check this and here

Related

XMLStreamReader doesn't read complete tag

I'm parsing XML using XMLStreamReader. In <dbresponse> tag there are some data loaded from database (WebRowSet object). The problem is that the content of this tag is very long (let's say several hundred kilobytes - the data are encoded in Base64), but input.getText() reads only 16.394 characters out of it.
I'm 100 % sure data coming to XMLStreamReader are OK.
I found some other answer here, but it doesn't solve my problem, I could of course use some other way how to read the data, but I would like to know what is the problem with this one.
Does somebody know how to get the whole content?
My code:
input = xmlFactory.createXMLStreamReader(new ByteArrayInputStream(xmlData.getBytes("UTF-8")));
while(input.hasNext()){
if(input.getEventType() == XMLStreamConstants.START_ELEMENT){
element = input.getName().getLocalPart();
switch(element.toLowerCase()){
case "transactionresponse":
int transactionStatus = 0;
transactionResponse = new TransactionResponse();
for(int i=0; i<input.getAttributeCount(); i++){
switch(input.getAttributeLocalName(i)){
case "status": transactionStatus = TransactionResponse.getStatusFromName(input.getAttributeValue(i));
}
}
transactionResponse.setStatus(transactionStatus);
break;
case "dbresponse":
for(int i=0; i<input.getAttributeCount(); i++){
switch(input.getAttributeLocalName(i)){
case "request_id": id = Integer.parseInt(input.getAttributeValue(i)); break;
case "status": status = Response.getStatusFromName(input.getAttributeValue(i));
}
}
break;
}
}else if(input.getEventType() == XMLStreamConstants.CHARACTERS){
switch(element.toLowerCase()){
case "dbresponse":
String data = input.getText();
if(!data.equals("\n")){
data = new String(Base64.decode(data), "UTF-8");
}
Response response = new Response(data, status, id);
if(transactionResponse != null){
transactionResponse.addResponse(response);
}else{
this.addResponse(response);
}
id = -1;
status = -1;
break;
}
element = "";
}else if(input.getEventType() == XMLStreamConstants.END_ELEMENT){
switch(input.getLocalName().toLowerCase()){
case "transactionresponse": this.addTransactionResponse(transactionResponse); transactionResponse = null; break;
}
}
input.next();

Event-driven XML parsers such as XMLStreamReader are designed to allow you to parse the XML without having to read it into memory all at one go, which is pretty essential in case you have a very large XML.
The design is such that it reads a certain buffer of data, and gives you events as it runs into "interesting" stuff, such as the beginning of a tag, the end of a tag, and so on.
But the buffer it reads is not infinite, as it is meant to handle large XML files, exactly like the one you have. For this reason, a large text in a tag may be represented by several consecutive CHARACTERS events.
That is, when you get a CHARACTERS event, there is no guarantee that it contains the whole text. If the text is too long for the reader's buffer, you will simply get more CHARACTERS events that follow.
Since you are only reading the data from the first CHARACTERS event, it is not the whole data.
The proper way to work with such a file is:
When you get a START_ELEMENT event for the element you are interested in, you make preparations for storing the text. For example, create a StringBuilder, or open a file for writing, etc.
For each CHARACTERS event that follows, you append the text to your storage (the StringBuilder, the file).
Once you get the END_ELEMENT event for the same element, you finish accumulating your data, and do whatever you need to do with it.
In fact, this is what the getElementText() method does for you - accumulates the data in a StringBuffer while going through CHARACTERS events until it hits the END_ELEMENT.
Bottom line: you only know you got the whole data when you hit the END_ELEMENT event. There is no guarantee that the text will be in a single CHARACTERS event.

I think the XMLStreamReader chunks the data, so maybe try looping getText() to concatenate all chunks ?
What about getElementText() method ?

Java Library to truncate html strings?

I need to truncate html string that was already sanitized by my app before storing in DB & contains only links, images & formatting tags. But while presenting to users, it need to be truncated for presenting an overview of content.
So I need to abbreviate html strings in java such that
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
when truncated does not return something like this
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="htt
but instead returns
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/>

Your requirements are a bit vague, even after reading all the comments. Given your example and explanations, I assume your requirements are the following:
The input is a string consisting of (x)html tags. Your example doesn't contain this, but I assume the input can contain text between the tags.
In the context of your problem, we do not care about nesting. So the input is really only text intermingled with tags, where opening, closing and self-closing tags are all considered equivalent.
Tags can contain quoted values.
You want to truncate your string such that the string is not truncated in the middle of a tag. So in the truncated string every '<' character must have a corresponding '>' character.
I'll give you two solutions, a simple one which may not be correct, depending on what the input looks like exactly, and a more complex one which is correct.
First solution
For the first solution, we first find the last '>' character before the truncate size (this corresponds to the last tag which was completely closed). After this character may come text which does not belong to any tag, so we then search for the first '<' character after the last closed tag. In code:
public static String truncate1(String input, int size)
{
if (input.length() < size) return input;
int pos = input.lastIndexOf('>', size);
int pos2 = input.indexOf('<', pos);
if (pos2 < 0 || pos2 >= size) {
return input.substring(0, size);
}
else {
return input.substring(0, pos2);
}
}
Of course this solution does not consider the quoted value strings: the '<' and '>' characters might occur inside a string, in which case they should be ignored. I mention the solution anyway because you mention your input is sanatized, so possibly you can ensure that the quoted strings never contain '<' and '>' characters.
Second solution
To consider the quoted strings, we cannot rely on standard Java classes anymore, but we have to scan the input ourselves and remember if we are currently inside a tag and inside a string or not. If we encounter a '<' character outside of a string, we remember its position, so that when we reach the truncate point we know the position of the last opened tag. If that tag wasn't closed, we truncate before the beginning of that tag. In code:
public static String truncate2(String input, int size)
{
if (input.length() < size) return input;
int lastTagStart = 0;
boolean inString = false;
boolean inTag = false;
for (int pos = 0; pos < size; pos++) {
switch (input.charAt(pos)) {
case '<':
if (!inString && !inTag) {
lastTagStart = pos;
inTag = true;
}
break;
case '>':
if (!inString) inTag = false;
break;
case '\"':
if (inTag) inString = !inString;
break;
}
}
if (!inTag) lastTagStart = size;
return input.substring(0, lastTagStart);
}

A robust way of doing it is to use the hotsax code which parses HTML letting you interface with the parser using the traditional low level SAX XML API [Note it is not an XML parser it parses poorly formed HTML in only chooses to let you interface with it using a standard XML API).
Here on github I have created a working quick-and-dirty example project which has a main class that parses your truncated example string:
XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser");
final StringBuilder builder = new StringBuilder();
ContentHandler handler = new DoNothingContentHandler(){
StringBuilder wholeTag = new StringBuilder();
boolean hasText = false;
boolean hasElements = false;
String lastStart = "";
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String text = (new String(ch, start, length)).trim();
wholeTag.append(text);
hasText = true;
}
#Override
public void endElement(String namespaceURI, String localName,
String qName) throws SAXException {
if( !hasText && !hasElements && lastStart.equals(localName)) {
builder.append("<"+localName+"/>");
} else {
wholeTag.append("</"+ localName +">");
builder.append(wholeTag.toString());
}
wholeTag = new StringBuilder();
hasText = false;
hasElements = false;
}
#Override
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) throws SAXException {
wholeTag.append("<"+ localName);
for( int i = 0; i < atts.getLength(); i++) {
wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'");
hasElements = true;
}
wholeTag.append(">");
lastStart = localName;
hasText = false;
}
};
parser.setContentHandler(handler);
//parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend some link" ) ));
parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) ));
System.out.println( builder.toString() );
It outputs:
<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>
It is adding an </img> tag but thats harmless for html and it would be possible to tweak the code to exactly match the input in the output if you felt that necessary.
Hotsax is actually generated code from using yacc/flex compiler tools run over the HtmlParser.y and StyleLexer.flex files which define the low level grammar of html. So you benefit from the work of the person who created that grammar; all you need to do is write some fairly trivial code and test cases to reassemble the parsed fragments as shown above. That's much better than trying to write your own regular expressions, or worst and coded string scanner, to try to interpret the string as that is very fragile.

Afer I understand what you want here is the most simple solution I could come up with.
Just work from the end of your substring to the start until you find '>' This is the end mark of the last tag. So you can be sure that you only have complete tags in the majority of cases.
But what if the > is inside texts?
Well to be sure about this just search on until you find < and ensure this is part of a tag (do you know the tag string for instance?, since you only have links, images and formating you can easily check this. If you find another > before finding < starting a tag this is the new end of your string.
Easy to do, correct and should work for you.
If you are not certain if strings / attributes can contain < or > you need to check the appearence of " and =" to check if you are inside a string or not. (Remember you can cut of an attribute values). But I think this is overengineering. I never found an attribute with < and > in it and usually within text it is also escaped using & lt ; and something alike.

I don't know the context of the problem the OP needs to solve, but I am not sure if it makes a lot of sense to truncate html code by the length of its source code instead of the length of its visual representation (which can become arbitrarily complex, of course).
Maybe a combined solution could be useful, so you don't penalize html code with a lot of markup or long links, but also set a clear total limit which cannot be exceeded. Like others already wrote, the usage of a dedicated HTML parser like JSoup allows the processing of non well-formed or even invalid HTML.
The solution is loosely based on JSoup's Cleaner. It traverses the parsed dom tree of the source code and tries to recreate a destination tree while continuously checking, if a limit has been reached.
import org.jsoup.nodes.*;
import org.jsoup.parser.*;
import org.jsoup.select.*;
String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />" +
"<br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";
//String html = "<b>foo</b>bar<p class=\"baz\">Some <img />Long Text</p><a href='#'>hello</a>";
Document srcDoc = Parser.parseBodyFragment(html, "");
srcDoc.outputSettings().prettyPrint(false);
Document dstDoc = Document.createShell(srcDoc.baseUri());
dstDoc.outputSettings().prettyPrint(false);
Element dst = dstDoc.body();
NodeVisitor v = new NodeVisitor() {
private static final int MAX_HTML_LEN = 85;
private static final int MAX_TEXT_LEN = 40;
Element cur = dst;
boolean stop = false;
int resTextLength = 0;
#Override
public void head(Node node, int depth) {
// ignore "body" element
if (depth > 0) {
if (node instanceof Element) {
Element curElement = (Element) node;
cur = cur.appendElement(curElement.tagName());
cur.attributes().addAll(curElement.attributes());
String resHtml = dst.html();
if (resHtml.length() > MAX_HTML_LEN) {
cur.remove();
throw new IllegalStateException("html too long");
}
} else if (node instanceof TextNode) {
String curText = ((TextNode) node).getWholeText();
String resHtml = dst.html();
if (curText.length() + resHtml.length() > MAX_HTML_LEN) {
cur.appendText(curText.substring(0, MAX_HTML_LEN - resHtml.length()));
throw new IllegalStateException("html too long");
} else if (curText.length() + resTextLength > MAX_TEXT_LEN) {
cur.appendText(curText.substring(0, MAX_TEXT_LEN - resTextLength));
throw new IllegalStateException("text too long");
} else {
resTextLength += curText.length();
cur.appendText(curText);
}
}
}
}
#Override
public void tail(Node node, int depth) {
if (depth > 0 && node instanceof Element) {
cur = cur.parent();
}
}
};
try {
NodeTraversor t = new NodeTraversor(v);
t.traverse(srcDoc.body());
} catch (IllegalStateException ex) {
System.out.println(ex.getMessage());
}
System.out.println(" in='" + srcDoc.body().html() + "'");
System.out.println("out='" + dst.html() + "'");
For the given example with max length of 85, the result is:
html too long
in='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'
out='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'
It also correctly truncates within nested elements, for a max html length of 16 the result is:
html too long
in='<i>f<b>oo</b>b</i>ar'
out='<i>f<b>o</b></i>'
For a maximum text length of 2, the result of a long link would be:
text too long
in='<b>foo</b>bar'
out='<b>fo</b>'

You can achieve this with library "JSOUP" - html parser.
You can download it from below link.
Download JSOUP
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class HTMLParser
{
public static void main(String[] args)
{
String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";
Document doc = Jsoup.parse(html);
doc.select("a").remove();
System.out.println(doc.body().children());
}
}

Well whatever you want to do. There are two libraries out there jSoup and HtmlParser which I tend to use. Please check them out. Also I see bearly XHTML in the wild anymore. Its more about HTML5 (which does not have an XHTML counterpart) nowadays.
[Update]
I mention JSoup and HtmlParser since they are fault tollerant in a way the browser is. Please check if they suite you since they are very good at dealing with malformed and damaged HTML text. Create a DOM out of your HTML and write it back to string you should get rid of the damaged tags also you can filter the DOM by yourself and remove even more content if you have to.
PS: I guess the XML decade is finally (and gladly) over. Today JSON is going to be overused.

A third potential answer I would consider as a potential solution is not to work with strings ins the first place.
When I remember correctly there are DOM tree representations that work closely with the underlying string presentation. Therefore they are character exact. I wrote one myself but I think jSoup has such a mode. Since there are a lot of parsers out there you should be able to find one that actually does.
With such a parser you can easily see which tag runs from what string position to another. Actually those parsers maintain a String of the document and alter it but only store range information like start and stop positions within the document avoiding to multiply those information for nested nodes.
Therefore you can find the most outer node for a given position, know exactly from what to where and easily can decide if this tag (including all its children) can be used to be presented within your snippet. So you will have the chance to print complete text nodes and alike without the risk to only present partial tag information or headline text and alike.
If you do not find a parser that suites you on this, you can ask me for advise.

Java sax parser bug

I am using java sax parser and i override
#Override
public void characters(char ch[], int start, int length) throws SAXException {
value = new String(ch, start, length);
in some case array ch contains qName of element but not contains entire value.
Example:
ch = [... , x, s, d, :, n, a, m, e, >, 1, 2, 3]
but the real value of xsd:name is 123456789
EDIT
String responseString = Utils.getXml(url);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
handler = new SimpleHandler();
saxParser.parse(new InputSource(new StringReader(responseString)), handler);
List<Entit> list = handler.getList();
I have xml like this (ofcourse the original xml is much bigger)
<root>
<el>
<xsd:name>11111111</xsd:name>
</el>
<el>
<xsd:name>22222222</xsd:name>
</el>
<el>
<xsd:name>123456789</xsd:name>
</el>
<el>
<xsd:name>333333333</xsd:name>
</el>
</root>
i get error just for one value in xml.
How to fix that.

The characters method does not necessarily return the entire set of characters. You need to store the result each time characters is called, something like:
final StringBuilder sb = new StringBuilder();
#Override
public void characters(char ch[], int start, int length) throws SAXException {
sb.append(ch, start, length);
}
You then need to reset your StringBuilder (or whatever you are using) when you find an end element tag or a begin element tag or whatever the case may be.
Read the specification for characters:
"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information."
Generally, what you should do is delete the text buffer when you see startElement or endElement. Usually you will do something with the current buffer when these are seen.

Loosing unicode/ASCII element once parse HTML document with Jsoup

I addressed a strange behavior when I parsed a HTML page which contains a unicode/ASCII element. Here the example git://gist.github.com/2995626.git.
What performed is:
File layout = new File(html_file);
Document doc = Jsoup.parse(layout, "UTF-8");
System.out.println(doc.toString());
What I expected was the HTML triangle, but it is converted to "â–¼". Do you have any suggestions?
Thanks in advance.

Jsoup is perfectly capable of parsing HTML using UTF-8. Even more, it's its default character encoding already. Your problem is caused elsewhere. Based on the information provided so far, I can see two possible problem causes:
The HTML file was originally not saved using UTF-8 (or perhaps it's one step before; it's originally not been read using UTF-8).
The stdout (there where the System.out goes to) does not use UTF-8.
If you make sure that both are correctly set, then your problem should disappear. If not, then there's another possible cause which is not guessable based on the information provided so far in your question. At least, this blog should bring a lot of new insight: Unicode - How to get the characters right?

It is a problem caused by unicode. Here you can have an example following. You can try the code below .The result will show you the cause why the code you write not working.
public static void main(String[] argv) {
String test = "Ch\u00e0o bu\u1ed5i s\u00e1ng";
System.out.println(unicode2String(test));
}
/**
* unicode 转字符串
*/
public static String unicode2String(String unicode) {
StringBuffer string = new StringBuffer();
String[] hex = unicode.split("\\\\u");
string.append(hex[0]);
for (int i = 1; i < hex.length; i++) {
// 转换出每一个代码点
int data = Integer.parseInt(hex[i], 16);
// 追加成string
string.append((char) data);
}
return string.toString();
}
Maybe you code should be as follows:
System.out.println(unicode2String(doc.toString()));

Get XML and encoded data in a file with hex zeros using Java

I have to read a file (existing format not under my control) that contains an XML document and encoded data. This file unfortunately includes MQ-related data around it including hex zeros (end of files).
So, using Java, how can I read this file, stripping or ignoring the "garbage" I don't need to get at the XML and encoded data. I believe an acceptable solution is to just leave out the hex zeros (are there other values that will stop my reading?) since I don't need the MQ information (RFH header) anyway and the counts are meaningless for my purposes.
I have searched a lot and only find really heinous complicated "solutions". There must be a better way...

What worked was to pull out the XML documents - Groovy code:
public static final String REQUEST_XML = "<Request>";
public static final String REQUEST_END_XML = "</Request>";
/**
* #param xmlMessage
* #return 1-N EncodedRequests for those I contain
*/
private void extractRequests( String xmlMessage ) {
int start = xmlMessage.indexOf(REQUEST_XML);
int end = xmlMessage.indexOf(REQUEST_END_XML);
end += REQUEST_END_XML.length();
while( start >= 0 ) { //each <Request>
requests.add(new EncodedRequest(xmlMessage.substring(start,end)));
start = xmlMessage.indexOf(REQUEST_XML, end);
end = xmlMessage.indexOf(REQUEST_END_XML, start);
end += REQUEST_END_XML.length();
}
}
and then decode the base64 portion:
public String getDecodedContents() {
if( decodedContents == null ) {
byte[] decoded = Base64.decodeBase64(getEncodedContents().getBytes());
String newString = new String(decoded);
decodedContents = newString;
decodedContents = decodedContents.replace('\r','\t');
}
return decodedContents;
}

I've hit this issue before (well ... something similar). Have a look a my FilterInputStream for a file filter that you should be able to modify to your needs.
Essentially it implements a push-back buffer that chucks away anything you don't want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.