java xml parsing for ISO-8859-9

java xml parsing for ISO-8859-9 - java

I'm trying to parse a string to xml for ISO-8859-9. My code is :
private Document stringToXML(String input)
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
builder = factory.newDocumentBuilder();
return builder.parse(new ByteArrayInputStream(input.getBytes("ISO-8859-9")));
}
if input includes just utf-8 characters, code runs correctly but input includes any special character like 'ğ' it throws "com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:"
How can i solve this problem?

Parse a StringReader via an InputSource.

If the input contains UTF-8 characters, then it is NOT an ISO-8859-9 stream. Parse it as UTF-8 or convert it to ISO-8859-9 before trying to parse. You only ever get one character set per document, trying to mix makes the whole thing meaningless.

Related

Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible

I get the following error due to Latin text in my XML.
Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible
My XML is written to a String variable (I don't import a file).
I tried to set encoding to "UTF-8", but I might have done it wrong.
Can you help please?
My code:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
InputStream inputStream = new ByteArrayInputStream(GET_XML.getBytes());
Document doc = dBuilder.parse(inputStream);
doc.getDocumentElement().normalize();

You are seeing this error, because you are feeding xml containing ISO-8859-1 (aka Latin-1) characters without proper XML declaration:
<?xml version='1.0' encoding='ISO-8859-1' standalone='no' ?>
You have two options either correct it by sourcing xml with above declaration.
OR forcing UTF-8 during byte conversion.
new ByteArrayInputStream(GET_XML.getBytes(StandardCharsets.UTF_8));

Setting the encoding on an inputstream

I'm processing xml in Java and I have the following code:
dbf.setValidating(false);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setNamespaceAware(true);
DocumentBuilder db = null;
db = dbf.newDocumentBuilder();
db.setEntityResolver(new NullResolver());
_logger.error("Before processing the input stream");
processXml(db.parse(is));
Where (is) is an InputStream.
This is resulting in the error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
Invalid byte 2 of 2-byte UTF-8
Which sounds like an error resulting from getting the wrong encoding. I would like to set the encoding on the InputStream but I am not sure how. I found ways to set the encoding on an InputSource or an InputStreamReader but then the db.parse does not take a reader/InputSource.
What is the best way to fix this?
Thanks!

DocumentBuilder.parse can take an InputSource. See the javadocs.
So you should try wrapping your InputStream in an InputReader (where you can specify the character set) and then create an InputSource based on that.
It's a bit convoluted, but these things happen in Java.
Something along the lines of

Parse a single Line of XML into a HashMap

I'm building an android app that communicates to a web server and am struggling with the following scenario:
Given ONE line of XML in a String eg:
"<test one="1" two="2" />"
I would like to extract the values into a HashMap so that:
map.get("one") = "1"
map.get("two") = "2"
I already can do this with a full XML document using the SAX Parser, this complains when i try to just give it the above string with a MalformedUrlException: Protocol not found
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
Document doc = null;
builder = factory.newDocumentBuilder();
doc = builder.parse("<test one="1" two="2" />"); //here
I realize some regex could do this but Id really rather do it properly.
The same behaviour can be found at http://metacpan.org/pod/XML::Simple#XMLin which is what the web server uses.
Can anyone help? Thanks :D

DocumentBuilder.parse(String) treats the string as a URL. Try this instead:
Document doc = builder.parse(new InputSource(new StringReader(text)));
(where text contains the XML, of course).

XML parsing issue with '&' in element text

I have the following code:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));
And the parse step is throwning:
SAXParseException: The entity name must immediately follow
the '&' in the entity reference
due to the following '&' in my inputXml:
<Line1>Day & Night</Line1>
I'm not in control of in the inbound XML. How can I safely/correctly parse this?

Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:
<Line1>Day & Night</Line1>
Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting them to fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.
If there's no entities in the document at all, a regex replace of & with & before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent & but meant &amp;) you're going to be completely out of luck.
But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)

Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.
You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& ' with '& ', this isn't going to catch every occurrence of & in the input, but you may be able to come up with something that does.

I used Tidy framework before xml parsing
final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));
...
And all Ok

is inputXML a string? Then use this:
inputXML = inputXML.replaceAll("&\\s+", "&");

Invalid character '&#x0' encountered

I am getting following exception while parsing the xml.
Fatal error at line -1 Invalid character '&#x0' encountered. No stack trace
I have Xml data in string format and I am parsing it using DOM parser.
I am parsing data which is a response from Java server to a Blackberry client.
I also tried parsing with SAX parser,but problem is not resolved.
Please help.

You have a null character in your character stream, i.e. char(0) which is not valid in an XML-document. If this is not present in the original string, then it is most likely a character decoding issue.

I got the solution,
I just trimmed it with trim()
and it worked perfectly fine with me.

Your code currently calls getBytes() using the platform default encoding - that's very rarely a good idea. Find out what the encoding of the data really is, and use that. (It's likely to be UTF-8.)
If the Blackberry includes DocumentBuilder.parse(InputSource), that would be preferable:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
StringReader reader = new StringReader(xmlData);
try {
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
} finally {
reader.close();
}
If that doesn't work, have a very close look at your string, e.g. like this:
for (int i=0; i < xmlData.length(); i++) {
// Use whatever logging you have on the Blackberry
System.out.println((int) xmlData.charAt(i));
}
It's possible that the problem is reading the response from the server - if you're reading it badly, you could have Unicode nulls (\u0000) in your string, which may not appear obviously in log/debug output, but would cause the error you've shown.
EDIT: I've just seen that you're getting the base64 data in the first place - so why convert it to a string and then back to bytes? Just decode the base64 to a byte array and then use that as the basis of your ByteArrayInputStream. Then you never have to deal with a text encoding in the first place.

InputStream xml = new ByteArrayInputStream(xmlData.getBytes());
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
xml.close();
Above is the code I am using for parsing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java xml parsing for ISO-8859-9 - java

Parse a StringReader via an InputSource.

If the input contains UTF-8 characters, then it is NOT an ISO-8859-9 stream. Parse it as UTF-8 or convert it to ISO-8859-9 before trying to parse. You only ever get one character set per document, trying to mix makes the whole thing meaningless.

Related

Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible

Setting the encoding on an inputstream

Parse a single Line of XML into a HashMap

XML parsing issue with '&' in element text

Invalid character '&#x0' encountered

Categories

Resources