SAXReader not re-ecape characters - java

I'm reading a XML file with dom4j. The file looks like this:
...
<Field>
hello, world...</Field>
...
I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:
\r\n hello, world...
I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.
How can I escape the special character and have
when writing the file?

You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.

It depends on what you're getting and what you want (see my previous comment.)
The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)
If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:
myString = myString.Replace("\r\n", "\\r\\n");

XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.
But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:
The Parser will call this method
before opening any external entity
except the top-level document entity
(including the external DTD subset,
external entities referenced within
the DTD, and external entities
referenced within the document
element)
Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:
Report the beginning of some internal
and external XML entities.
You will not be able to change the resolving, but that may still help.
EDIT
You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.
FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();

You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.
Example (using streamflyer):
import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;
// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");
// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...
// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");
// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();
You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.

Related

How to modify a huge XML file by StAX?

I have a huge XML (~2GB) and I need to add new Elements and modify the old ones. For example, I have:
<books>
<book>....</book>
...
<book>....</book>
</books>
And want to get:
<books>
<book>
<index></index>
....
</book>
...
<book>
<index></index>
....
</book>
</books>
I used the following code:
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream(file));
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(new FileWriter(file, true));
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.getEventType() == XMLEvent.START_ELEMENT) {
if (event.asStartElement().getName().toString().equalsIgnoreCase("book")) {
writer.writeStartElement("index");
writer.writeEndElement();
}
}
}
writer.close();
But the result was the following:
<books>
<book>....</book>
....
<book>....</book>
</books><index></index>
Any ideas?
Try this
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream("1.xml"));
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLEventWriter writer = factory.createXMLEventWriter(new FileWriter(file));
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
writer.add(event);
if (event.getEventType() == XMLEvent.START_ELEMENT) {
if (event.asStartElement().getName().toString().equalsIgnoreCase("book")) {
writer.add(eventFactory.createStartElement("", null, "index"));
writer.add(eventFactory.createEndElement("", null, "index"));
}
}
}
writer.close();
Notes
new FileWriter(file, true) is appending to the end of the file, you hardly really need it
equalsIgnoreCase("book") is bad idea because XML is case-sensitive
Well it is pretty clear why it behaves the way it does. What you are actually doing is opening the existing file in output append mode and writing elements at the end. That clearly contradicts what you are trying to do.
(Aside: I'm surprised that it works as well as it does given that the input side is likely to see the elements that the output side is added to the end of the file. And indeed the exceptions like Evgeniy Dorofeev's example gives are the sort of thing I'd expect. The problem is that if you attempt to read and write a text file at the same time, and either the reader or writer uses any form of buffering, explicit or implicit, the reader is liable to see partial states.)
To fix this you have to start by reading from one file and writing to a different file. Appending won't work. Then you have to arrange that the elements, attributes, content etc that are read from the input file are copied to the output file. Finally, you need to add the extra elements at the appropriate points.
And is there any possibility to open the XML file in mode like RandomAccessFile, but write in it by StAX methods?
No. That is theoretically impossible. In order to to be able to navigate around an XML file's structure in a "random" file, you'd first need to parse the whole thing and build an index of where all the elements are. Even when you've done that, the XML is still stored as characters in a file, and random access does not allow you to insert and remove characters in the middle of a file.
Maybe your best bet would be combining XSL and a SAX style parser; e.g. something along the lines of this IBM article: http://ibm.com/developerworks/xml/library/x-tiptrax
Maybe this StAX Read-and-Write Example in JavaEE tutorial helps: http://docs.oracle.com/javaee/5/tutorial/doc/bnbfl.html#bnbgq
You can download the tutorial examples here: https://java.net/projects/javaeetutorial/downloads

Stop Jsoup from encoding

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni.
After parsing the document the same string looks like that: Ætterni.
How do I prevent this form happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.

Howto let the SAX parser determine the encoding from the xml declaration?

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

How to create XML file?

I have some data which my program discovers after observing a few things about files.
For instance, i know file name, time file was last changed, whether file is binary or ascii text, file content (assuming it is properties) and some other stuff.
i would like to store this data in XML format.
How would you go about doing it?
Please provide example.
If you want something quick and relatively painless, use XStream, which lets you serialise Java Objects to and from XML. The tutorial contains some quick examples.
Use StAX; it's so much easier than SAX or DOM to write an XML file (DOM is probably the easiest to read an XML file but requires you to have the whole thing in memory), and is built into Java SE 6.
A good demo is found here on p.2:
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeAttribute("id", "g1");
writer.writeCharacters("Hello StAX");
writer.writeEndDocument();
writer.flush();
writer.close();
out.close();
Standard are the W3C libraries.
final Document docToSave = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
final Element fileInfo = docToSave.createElement("fileInfo");
docToSave.appendChild(fileInfo);
final Element fileName = docToSave.createElement("fileName");
fileName.setNodeValue("filename.bin");
fileInfo.appendChild(fileName);
return docToSave;
XML is almost never the easiest thing to do.
You can use to do that SAX or DOM, review this link: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044810.html
I think is that you want

How to preserve newlines in CDATA when generating XML?

I want to write some text that contains whitespace characters such as newline and tab into an xml file so I use
Element element = xmldoc.createElement("TestElement");
element.appendChild(xmldoc.createCDATASection(somestring));
but when I read this back in using
Node vs = xmldoc.getElementsByTagName("TestElement").item(0);
String x = vs.getFirstChild().getNodeValue();
I get a string that has no newlines anymore.
When i look directly into the xml on disk, the newlines seem preserved. so the problem occurs when reading in the xml file.
How can I preserve the newlines?
Thanks!
I don't know how you parse and write your document, but here's an enhanced code example based on yours:
// creating the document in-memory
Document xmldoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = xmldoc.createElement("TestElement");
xmldoc.appendChild(element);
element.appendChild(xmldoc.createCDATASection("first line\nsecond line\n"));
// serializing the xml to a string
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl =
(DOMImplementationLS)registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
String str = writer.writeToString(xmldoc);
// printing the xml for verification of whitespace in cdata
System.out.println("--- XML ---");
System.out.println(str);
// de-serializing the xml from the string
final Charset charset = Charset.forName("utf-16");
final ByteArrayInputStream input = new ByteArrayInputStream(str.getBytes(charset));
Document xmldoc2 = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(input);
Node vs = xmldoc2.getElementsByTagName("TestElement").item(0);
final Node child = vs.getFirstChild();
String x = child.getNodeValue();
// print the value, yay!
System.out.println("--- Node Text ---");
System.out.println(x);
The serialization using LSSerializer is the W3C way to do it (see here). The output is as expected, with line separators:
--- XML ---
<?xml version="1.0" encoding="UTF-16"?>
<TestElement><![CDATA[first line
second line ]]></TestElement>
--- Node Text ---
first line
second line
You need to check the type of each node using node.getNodeType(). If the type is CDATA_SECTION_NODE, you need to concat the CDATA guards to node.getNodeValue.
You don't necessarily have to use CDATA to preserve white space characters.
The XML specification specify how to encode these characters.
So for example, if you have an element with value that contains new space you should encode it with
Carriage return:
And so forth
EDIT: cut all the irrelevant stuff
I'm curious to know what DOM implementation you're using, because it doesn't mirror the default behaviour of the one in a couple of JVMs I've tried (they ship with a Xerces impl). I'm also interested in what newline characters your document has.
I'm not sure if whether CDATA should preserve whitespace is a given. I suspect that there are many factors involved. Don't DTDs/schemas affect how whitespace is processed?
You could try using the xml:space="preserve" attribute.
xml:space='preserve' is not it. That is only for "all whitespace" nodes. That is, if you want the whitespace nodes in
<this xml:space='preserve'> <has/>
<whitespace/>
</this>
But see that those whitespace nodes are ONLY whitespace.
I have been struggling to get Xerces to generate events allowing isolation of CDATA content as well. I have no solution as yet.

Categories