Setting the encoding on an inputstream - java

I'm processing xml in Java and I have the following code:
dbf.setValidating(false);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setNamespaceAware(true);
DocumentBuilder db = null;
db = dbf.newDocumentBuilder();
db.setEntityResolver(new NullResolver());
_logger.error("Before processing the input stream");
processXml(db.parse(is));
Where (is) is an InputStream.
This is resulting in the error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
Invalid byte 2 of 2-byte UTF-8
Which sounds like an error resulting from getting the wrong encoding. I would like to set the encoding on the InputStream but I am not sure how. I found ways to set the encoding on an InputSource or an InputStreamReader but then the db.parse does not take a reader/InputSource.
What is the best way to fix this?
Thanks!

DocumentBuilder.parse can take an InputSource. See the javadocs.
So you should try wrapping your InputStream in an InputReader (where you can specify the character set) and then create an InputSource based on that.
It's a bit convoluted, but these things happen in Java.
Something along the lines of

Related

How to improve memory usage when converting BufferedReader to ByteArrayInputStream?

I am running into some out of memory exceptions when reading in very very large XML strings and converting them into a Document object.
The way I am doing this is I am opening a URL stream to the XML file, wrapping that in an InputStreamReader, then wrapping that in a BufferedReader.
Then I read from the BufferedReader and append to a StringBuffer:
StringBuffer doc = new StringBuffer();
BufferedReader in = new BufferedReader(newInputStreamReader(downloadURL.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
doc.append(inputLine);
}
Now this is the part I am having an issue with. I am using toString on the StringBuffer to be able to get the bytes to create a byte array which is then used to create a ByteArrayInputStream. I believe that this step is causing me to have the same data in memory twice, is that right?
Here is what I am doing:
byte xmlBytes[] = doc.toString().getBytes();
ByteArrayInputStream is = new ByteArrayInputStream(xmlBytes);
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
Builder xmlBuilder = new Builder(xmlReader,false);
Document d = xmlBuilder.build(is);
Is there a way I can avoid creating duplicate memory (if I am doing so in the first place) or is there a way to convert the BufferedReader straight into a ByteArrayInputStream?
Thanks
Here is how you can consume an InputStream to create a Document using a DOM parser:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document document = builder.parse(inputStream);
This creates less intermediary copies. However, if the XML document is very large, instead of parsing it completely in memory, the best solution is to use a StAX parser.
With a StAX parser, you don't load the entire parsed document in memory. Instead, you handle each element found sequentially (and the element is thrown away immediately).
Here is a good explanation: Java: Parsing XML files: DOM, SAX or StAX?
There are also SAX parsers, but it's much easier to use StAX. Discussion here: When should I choose SAX over StAX?
If your XML (or JSON) file is large then it is not a good idea to load the whole content to memory because as you mentioned the parsing process consumes huge memory.
This issue can be more serious in case of more users (I mean more then one thread). Just imagine what will happen if your application needs to serve two, ten or more parallel requests...
The best way to process huge file as a stream and after you read the payload from the stream you can close it without read the stream till the end. It is more faster and memory friendly solution.
Apache Commons IO can help you to do the job:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
LineIterator.closeQuietly(it);
}
The another way to handle this issue is to split your XML file to parts and then you can process the smaller parts without any issue.

java xml parsing for ISO-8859-9

I'm trying to parse a string to xml for ISO-8859-9. My code is :
private Document stringToXML(String input)
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
builder = factory.newDocumentBuilder();
return builder.parse(new ByteArrayInputStream(input.getBytes("ISO-8859-9")));
}
if input includes just utf-8 characters, code runs correctly but input includes any special character like 'ğ' it throws "com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:"
How can i solve this problem?
Parse a StringReader via an InputSource.
If the input contains UTF-8 characters, then it is NOT an ISO-8859-9 stream. Parse it as UTF-8 or convert it to ISO-8859-9 before trying to parse. You only ever get one character set per document, trying to mix makes the whole thing meaningless.

SAX Parser doesn't recognize windows-1255 encoding

I'm working on a rss parser in android
(upgrading a parser I found on the internet).
From what I know SAX Parser recognize the encoding automatically from the xml tag, but when I try to parse a feed that declare windows-1255 encoding it doesn't parsing it and throws and exception.
I tried few things:
final InputSource source = new InputSource(feed);
Reader isr = new InputStreamReader(feed);
source.setCharacterStream(isr);
I even tried telling him the specific encoding.
source.setEncoding("Windows-1255");
Tried to look at the locator:
#Override
public void setDocumentLocator(Locator locator) {
}
And it recognize the encoding as UTF-16.
Please help me solve this annoying problem!
Sorry for the mess with code snippets the code button refuse to work for some reason.
Chances are the platform itself doesn't know about the "windows-1255" encoding. After all, it's a Windows-based encoding - I wouldn't want to rely on it being available on any other platforms, particularly mobile ones where things are generally cut down to the "must-have" options.
You need to set the encoding to the InputStreamReader.
Reader isr = new InputStreamReader(feed, "windows-1255");
final InputSource source = new InputSource(isr);
From javadoc the logic for reading from InputSource goes something like this:
Is there a character stream? if there is, use that(This is what happens if you use a Reader like InputStreamReader)
Otherwise:
No character stream? Use byte stream. (InputStream)
Is there a encoding set for InputSource? Use that
There was no encoding set? Try parsing the encoding from the xml file

Stop Jsoup from encoding

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni.
After parsing the document the same string looks like that: Ætterni.
How do I prevent this form happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.

Invalid character '&#x0' encountered

I am getting following exception while parsing the xml.
Fatal error at line -1 Invalid character '&#x0' encountered. No stack trace
I have Xml data in string format and I am parsing it using DOM parser.
I am parsing data which is a response from Java server to a Blackberry client.
I also tried parsing with SAX parser,but problem is not resolved.
Please help.
You have a null character in your character stream, i.e. char(0) which is not valid in an XML-document. If this is not present in the original string, then it is most likely a character decoding issue.
I got the solution,
I just trimmed it with trim()
and it worked perfectly fine with me.
Your code currently calls getBytes() using the platform default encoding - that's very rarely a good idea. Find out what the encoding of the data really is, and use that. (It's likely to be UTF-8.)
If the Blackberry includes DocumentBuilder.parse(InputSource), that would be preferable:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
StringReader reader = new StringReader(xmlData);
try {
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
} finally {
reader.close();
}
If that doesn't work, have a very close look at your string, e.g. like this:
for (int i=0; i < xmlData.length(); i++) {
// Use whatever logging you have on the Blackberry
System.out.println((int) xmlData.charAt(i));
}
It's possible that the problem is reading the response from the server - if you're reading it badly, you could have Unicode nulls (\u0000) in your string, which may not appear obviously in log/debug output, but would cause the error you've shown.
EDIT: I've just seen that you're getting the base64 data in the first place - so why convert it to a string and then back to bytes? Just decode the base64 to a byte array and then use that as the basis of your ByteArrayInputStream. Then you never have to deal with a text encoding in the first place.
InputStream xml = new ByteArrayInputStream(xmlData.getBytes());
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
xml.close();
Above is the code I am using for parsing.

Categories