Howto let the SAX parser determine the encoding from the xml declaration? - java

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan

Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.

I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

Related

UTF-8 in clobval query and sax parser

I am using the below oracle query to retrieve the data from Oracle database. My column type is XMLTYPE:
select a.xmlrecord.getClobVal() xmlrecord "+"
from" + " " + tablename + " a
The reason why I am using getclobVal() is we have a limitations in getstringVal() query where we cannot retrieve more than 4000 characters in Oracle.
Currently I am extracting the data from database and sending it directly to sax parser. Below is the piece of code which I'm using
while (orset.next()){
Reader reader = new BufferedReader(orset.getCharacterStream("xmlrecord")); // to retrieve getClob
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
sp.parse(is, handler);
}
The problem is we are unable to retrieve UTF-8 characters even though I am encoding UTF-8 in my code.
Kindly assist.
Your reader is a CharacterStream and not a ByteStream. Encodings are ignored for character stream and has an effect only on byte streams so if you wish to incorporate encoding , create your BufferedReader for byte stream instead of character stream ,
I am quoting two sources below,
Class InputSource
The SAX parser will use the InputSource object to determine how to
read XML input. If there is a character stream available, the parser
will read that stream directly, disregarding any text encoding
declaration found in that stream. If there is no character stream, but
there is a byte stream, the parser will use that byte stream, using
the encoding specified in the InputSource or else (if no encoding is
specified) autodetecting the character encoding using an algorithm
such as the one in the XML specification. If neither a character
stream nor a byte stream is available, the parser will attempt to open
a URI connection to the resource identified by the system identifier.
setEncoding
This method has no effect when the application provides a character
stream.
UTF-8 is working fine with characterstream resultset.
The above piece of code returned UTF-8 characters and the problem is due to the Windows machine doesn't support UTF-8 character set.
Finally we installed a package for Arabic character(UTF-8) in windows PC and the issue is resolved.

Parsing InputStream from URL StAX

I am trying to parse RSS feed, and have problem with encoding
if encoding utf-8, result correct, but problem with another type, espessially windows-1251
the code is below
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new URL(channel.getUrl()).openStream();;
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
I don't want save a content to locale file, after read. Can anybody help?
It is very difficult to guess the encoding by just analyzing the bytes from the input stream. Therefore normally, the platform's default encoding is used, when you dio not specify it.
However, an XMLInputFactory can create XMLEventReader that uses a specific encoding. Just call the method XMLInputFactory.createXMLEventReader(InputStream stream, String encoding).
That means, you must know the encoding before. Maybe there is a contract for the interface you are serving.

SAX Parser doesn't recognize windows-1255 encoding

I'm working on a rss parser in android
(upgrading a parser I found on the internet).
From what I know SAX Parser recognize the encoding automatically from the xml tag, but when I try to parse a feed that declare windows-1255 encoding it doesn't parsing it and throws and exception.
I tried few things:
final InputSource source = new InputSource(feed);
Reader isr = new InputStreamReader(feed);
source.setCharacterStream(isr);
I even tried telling him the specific encoding.
source.setEncoding("Windows-1255");
Tried to look at the locator:
#Override
public void setDocumentLocator(Locator locator) {
}
And it recognize the encoding as UTF-16.
Please help me solve this annoying problem!
Sorry for the mess with code snippets the code button refuse to work for some reason.
Chances are the platform itself doesn't know about the "windows-1255" encoding. After all, it's a Windows-based encoding - I wouldn't want to rely on it being available on any other platforms, particularly mobile ones where things are generally cut down to the "must-have" options.
You need to set the encoding to the InputStreamReader.
Reader isr = new InputStreamReader(feed, "windows-1255");
final InputSource source = new InputSource(isr);
From javadoc the logic for reading from InputSource goes something like this:
Is there a character stream? if there is, use that(This is what happens if you use a Reader like InputStreamReader)
Otherwise:
No character stream? Use byte stream. (InputStream)
Is there a encoding set for InputSource? Use that
There was no encoding set? Try parsing the encoding from the xml file

SAXReader not re-ecape characters

I'm reading a XML file with dom4j. The file looks like this:
...
<Field>
hello, world...</Field>
...
I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:
\r\n hello, world...
I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.
How can I escape the special character and have
when writing the file?
You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.
It depends on what you're getting and what you want (see my previous comment.)
The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)
If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:
myString = myString.Replace("\r\n", "\\r\\n");
XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.
But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:
The Parser will call this method
before opening any external entity
except the top-level document entity
(including the external DTD subset,
external entities referenced within
the DTD, and external entities
referenced within the document
element)
Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:
Report the beginning of some internal
and external XML entities.
You will not be able to change the resolving, but that may still help.
EDIT
You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.
FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.
Example (using streamflyer):
import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;
// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");
// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...
// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");
// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();
You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.

Validating a HUGE XML file

I'm trying to find a way to validate a large XML file against an XSD. I saw the question ...best way to validate an XML... but the answers all pointed to using the Xerces library for validation. The only problem is, when I use that library to validate a 180 MB file then I get an OutOfMemoryException.
Are there any other tools,libraries, strategies for validating a larger than normal XML file?
EDIT: The SAX solution worked for java validation, but the other two suggestions for the libxml tool were very helpful as well for validation outside of java.
Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new SimpleErrorHandler());
reader.parse(new InputSource(new FileReader ("document.xml")));
Use libxml, which performs validation and has a streaming mode.
Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.
SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass

Categories