I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.
I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file
Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf
Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.
Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.
I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}
Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);
Related
I have a requirement to create a sort of 'skeleton' xml based on an XSD schema.
The documents defined by these schemas have no namespace. They are authored by other developers, not in an automated way.
There is no mixed content allowed. That is, elements can contain elements only, or text only.
The rules for this sample xml are:
elements that can contain only text content should not be created in the sample xml
all other optional and mandatory elements should be included in the sample xml
elements should be created only once even if they can occur multiple times
any other nodes such as attributes, comments, processing instruction, etc. should be ommited - the sample xml would be an 'element tree'
Are there APIs or tools in Java that can generate such sample xml? I'm looking for pointers where to get started.
This needs to be done programmatically in a reliable way, as the sample xml is used by other XSLT transformations.
Hope below code will serve your purpose
package com.example.demo;
import java.io.File;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import jlibs.xml.sax.XMLDocument;
import jlibs.xml.xsd.XSInstance;
import jlibs.xml.xsd.XSParser;
public interface xsdtoxml {
public static void main(String[] pArgs) {
try {
String filename = "out.xsd";
// instance.
final Document doc = loadXsdDocument(filename);
//Find the docs root element and use it to find the targetNamespace
final Element rootElem = doc.getDocumentElement();
String targetNamespace = null;
if (rootElem != null && rootElem.getNodeName().equals("xs:schema"))
{
targetNamespace = rootElem.getAttribute("targetNamespace");
}
//Parse the file into an XSModel object
org.apache.xerces.xs.XSModel xsModel = new XSParser().parse(filename);
//Define defaults for the XML generation
XSInstance instance = new XSInstance();
instance.minimumElementsGenerated = 1;
instance.maximumElementsGenerated = 1;
instance.generateDefaultAttributes = true;
instance.generateOptionalAttributes = true;
instance.maximumRecursionDepth = 0;
instance.generateAllChoices = true;
instance.showContentModel = true;
instance.generateOptionalElements = true;
//Build the sample xml doc
//Replace first param to XMLDoc with a file input stream to write to file
QName rootElement = new QName(targetNamespace, "out");
XMLDocument sampleXml = new XMLDocument(new StreamResult(System.out), true, 4, null);
instance.generate(xsModel, rootElement, sampleXml);
} catch (TransformerConfigurationException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static Document loadXsdDocument(String inputName) {
final String filename = inputName;
final DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
factory.setIgnoringComments(true);
Document doc = null;
try {
final DocumentBuilder builder = factory.newDocumentBuilder();
final File inputFile = new File(filename);
doc = builder.parse(inputFile);
} catch (final Exception e) {
e.printStackTrace();
// throw new ContentLoadException(msg);
}
return doc;
}
}
xsd to xml :
1 : you can use eclipse (right click and select Generate)
2 : Sun/Oracle Multi-Schema Validator
3 : xmlgen
see:
How to generate sample XML documents from their DTD or XSD?
for subtle requirements, you should program it yourself
I have issue with using utf-8 with JDOM parser. My code won't write encoded in utf-8.
Here's code:
import java.io.FileWriter;
import java.io.IOException;
import org.jdom2.Element;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.Document;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class Jdom2 {
public static void main(String[] args) throws JDOMException, IOException {
String textValue = null;
try {
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build("path/to/file.xml");
Element rootNode = doc.getRootElement();
Element application = rootNode.getChild("application");
Element de_DE = application.getChild("de_DE");
textValue = de_DE.getText();
System.out.println(textValue);
application.getChild("de_DE").setText(textValue);
Format format = Format.getPrettyFormat();
format.setEncoding("UTF-8");
XMLOutputter xmlOutput = new XMLOutputter(format);
xmlOutput.output(doc, new FileWriter("path/to/file.xml"));
} catch (IOException io) {
io.printStackTrace();
} catch (JDOMException e) {
e.printStackTrace();
}
}
}
So line System.out.println(textValue); prints text "Mit freundlichen Grüßen", which is exactly what is read from xml file, but when it writing down, I get "Mit freundlichen Gr����"
Here's my xml file
<?xml version="1.0" encoding="UTF-8"?>
<applications>
<application id="default">
<de_DE>Mit freundlichen Grüßen</de_DE>
</application>
</applications>
How will I achieve writing "Mit freundlichen Grüßen" instead of "Mit freundlichen Gr����" ?
Thanks in advance!
This is probably the issue:
xmlOutput.output(doc, new FileWriter("path/to/file.xml"));
FileWriter always uses the platform default encoding. If that's not UTF-8, you've got problems.
I suggest you use a FileOutputStream instead, and let JDOM do the right thing. (Also, I strongly suspect you should keep a reference to the stream so you can close it in a finally block...)
I would totally go for #Jon Skeet answer, but if the code is in the 3rd party library that you cannot modify there are two other options:
Run your JVM with additional environment variable specifying file.encoding:
java -Dfile.encoding=UTF-8 … Jdom2
You can try to modify the Charset.defaultCharset which stores encoding used by FileWritter. Keep in mind that it's a very hacky solution that might break at some point and I would avoid it at all:
Field field = null;
for (Field f : Charset.class.getDeclaredFields()) {
if (f.getName().equals("defaultCharset")) {
field = f;
}
}
field.setAccessible(true);
Field modifiersField = Field.class.getDeclaredField("modifiers");
modifiersField.setAccessible(true);
modifiersField.setInt(field, field.getModifiers() & ~Modifier.FINAL);
field.set(null, StandardCharsets.UTF_8);
One last remark. Both solution might introduce side effects in other parts of your application because you are changing default encoding that might be used by some libraries.
I've an xml file that I would avoid having to load all in memory.
As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)
My current problem is that I would like to process the file "by chunk" which means:
Parse the file and find a relevant tag (node)
Load this tag entirely in memory (like we would do it in DOM)
Do the process of this entity (that local chunk)
When I'm done with the chunk, release it and continue to 1. (until "end of file")
In a perfect world I'm searching some something like this:
// 1. Create a parser and set the file to load
IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
void aNodeIsFound(saxNode aNode)
{
// 5. Inflate the current node i.e. load it (and all its content) in memory
DomNode d = aNode.expand();
// 6. Do something with the inflated node (method to be defined somewhere)
doThingWithNode(d);
}
});
// 7. Start the parser
p.start();
I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.
Is there any Java framework or library relevant to this kind of task?
UPDATE
You could also just use the javax.xml.xpath APIs:
package forum7998733;
import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
public class XPathDemo {
public static void main(String[] args) throws Exception {
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
InputSource xml = new InputSource(new FileReader("BigFile.xml"));
Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
System.out.println(result);
}
}
Below is a sample of how it could be done with StAX.
input.xml
Below is some sample XML:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
Demo
In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.
package forum7998733;
import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
DOMResult domResult = new DOMResult();
t.transform(new StAXSource(xsr), domResult);
DOMSource domSource = new DOMSource(domResult.getNode());
StreamResult streamResult = new StreamResult(System.out);
t.transform(domSource, streamResult);
}
}
}
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
...stuff...
</statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
...stuff...
</statement>
It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple / delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.
As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.
The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.
StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.
There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.
Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.
Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.
When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.
Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.
Good luck!
EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:
package staxdom;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class DOMExtractor {
private final Set<String> paths;
private final XMLInputFactory inputFactory;
private final XMLOutputFactory outputFactory;
private final DocumentBuilderFactory docBuilderFactory;
private final Stack<QName> activeStack = new Stack<QName>();
private boolean active = false;
private String currentPath = "";
public DOMExtractor(final Set<String> paths) {
this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
inputFactory = XMLInputFactory.newFactory();
outputFactory = XMLOutputFactory.newFactory();
docBuilderFactory = DocumentBuilderFactory.newInstance();
}
public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {
final XMLEventReader reader = inputFactory.createXMLEventReader(input);
XMLEventWriter writer = null;
StringWriter buffer = null;
final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();
XMLEvent currentEvent = reader.nextEvent();
do {
if(active)
writer.add(currentEvent);
if(currentEvent.isEndElement()) {
if(active) {
activeStack.pop();
if(activeStack.isEmpty()) {
writer.flush();
writer.close();
final Document doc;
final StringReader docReader = new StringReader(buffer.toString());
try {
doc = builder.parse(new InputSource(docReader));
} finally {
docReader.close();
}
//TODO: use doc
//Next bit is only for demo...
outputDoc(doc);
active = false;
writer = null;
buffer = null;
}
}
int index;
if((index = currentPath.lastIndexOf('/')) >= 0)
currentPath = currentPath.substring(0, index);
} else if(currentEvent.isStartElement()) {
final StartElement start = (StartElement)currentEvent;
final QName qName = start.getName();
final String local = qName.getLocalPart();
currentPath += "/" + local;
if(!active && paths.contains(currentPath)) {
active = true;
buffer = new StringWriter();
writer = outputFactory.createXMLEventWriter(buffer);
writer.add(currentEvent);
}
if(active)
activeStack.push(qName);
}
currentEvent = reader.nextEvent();
} while(!currentEvent.isEndDocument());
}
private void outputDoc(final Document doc) {
try {
final Transformer t = TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(doc), new StreamResult(System.out));
System.out.println("");
System.out.println("");
} catch(TransformerException ex) {
ex.printStackTrace();
}
}
public static void main(String[] args) {
final Set<String> paths = new HashSet<String>();
paths.add("/root/one");
paths.add("/root/three/embedded");
final DOMExtractor me = new DOMExtractor(paths);
InputStream stream = null;
try {
stream = DOMExtractor.class.getResourceAsStream("sample.xml");
me.parse(stream);
} catch(final Exception e) {
e.printStackTrace();
} finally {
if(stream != null)
try {
stream.close();
} catch(IOException ex) {
ex.printStackTrace();
}
}
}
}
And the sample.xml file (should be in the same package):
<?xml version="1.0" encoding="UTF-8"?>
<root>
<one>
<two>this is text</two>
look, I can even handle mixed!
</one>
... not sure what to do with this, though
<two>
<willbeignored/>
</two>
<three>
<embedded>
<and><here><we><go>
Creative Commons Legal Code
Attribution 3.0 Unported
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
DAMAGES RESULTING FROM ITS USE.
License
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
CONDITIONS.
</go></we></here></and>
</embedded>
</three>
</root>
EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .
ok thanks to your pieces of code, I finally end up with my solution:
Usage is quite intuitive:
try
{
/* CREATE THE PARSER */
XMLParser parser = new XMLParser();
/* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
/* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
XMLHandler handler = new XMLHandler()
{
public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
{
// DO SOMETHING WITH THE FOUND XML NODE
System.out.println("Node found");
System.out.println(node.toString());
}
};
/* ATTACH THE FILTER WITH THE HANDLER */
parser.addFilterWithHandler(filter, handler);
/* SET THE FILE TO PARSE */
parser.setFilePath("/path/to/bigfile.xml");
/* RUN THE PARSER */
parser.parse();
}
catch (Exception ex)
{
ex.printStackTrace();
}
Note:
I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely
All the of the code is here:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;
/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {
abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}
/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}
/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {
abstract boolean isRelevant(Stack fullPath);
}
/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {
Pattern relevantExpression;
XMLRegexFilter(String filterRules) {
relevantExpression = Pattern.compile(filterRules);
}
/* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
* OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
#Override
public boolean isRelevant(Stack fullPath) {
/* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
* ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
* FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
String stackPath = XMLParser.StackToString(fullPath);
Matcher m = relevantExpression.matcher(stackPath);
return m.matches();
}
}
/* THE MAIN PARSER'S CLASS */
public class XMLParser {
HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
HashMap<Integer, Integer> feedingStreams;
Stack<HashMap> currentStack;
String filePath;
XMLParser() {
currentStack = new <HashMap>Stack();
filterHandler = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
feedingStreams = new <Integer, Integer>HashMap();
}
public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
filterHandler.put(f, h);
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
/* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT
* I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
* DO SO
*/
public static String StackToString(Stack<HashMap> s) {
int k = s.size();
if (k == 0) {
return null;
}
StringBuilder out = new StringBuilder();
out.append(s.get(0).get("tag"));
for (int x = 1; x < k; ++x) {
HashMap node = s.get(x);
out.append('/').append(node.get("tag"));
/*
// UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH
ArrayList <String[]>attributes = (ArrayList)node.get("attr");
if (attributes.size()>0)
{
out.append("[");
for (int i = 0 ; i<attributes.size(); i++)
{
String[]keyValuePair = attributes.get(i);
if (i>0) out.append(",");
out.append(keyValuePair[0]);
out.append("=\"");
out.append(keyValuePair[1]);
out.append("\"");
}
out.append("]");
}*/
}
return out.toString();
}
/*
* ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
* WE THEN RETRIEVE THE DATA FROM IT.
*/
private StringBuilder getChunk(int from, int to) throws Exception {
int length = to - from;
FileReader f = new FileReader(filePath);
BufferedReader br = new BufferedReader(f);
br.skip(from);
char[] readb = new char[length];
br.read(readb, 0, length);
StringBuilder b = new StringBuilder();
b.append(readb);
return b;
}
/* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
HashMap h = new HashMap();
ArrayList attributes = new ArrayList();
for (int i = 0; i < xsr.getAttributeCount(); i++) {
String[] s = new String[2];
s[0] = xsr.getAttributeName(i).toString();
s[1] = xsr.getAttributeValue(i);
attributes.add(s);
}
h.put("tag", xsr.getName());
h.put("attr", attributes);
return h;
}
public void parse() throws Exception {
FileReader f = new FileReader(filePath);
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(f);
Location previousLoc = xsr.getLocation();
while (xsr.hasNext()) {
switch (xsr.next()) {
case XMLStreamConstants.START_ELEMENT:
currentStack.add(XSRNode2HashMap(xsr));
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
}
}
previousLoc = xsr.getLocation();
break;
case XMLStreamConstants.END_ELEMENT:
Integer stream = null;
if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
// FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
XMLNodeFoundNotifier h = filterHandler.get(filter);
StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
h.nodeFound(aChunk, filter);
}
}
feedingStreams.remove(currentStack.hashCode());
}
previousLoc = xsr.getLocation();
currentStack.pop();
break;
default:
break;
}
}
}
}
A little while since i did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then run your process, clear it out and look for the next start tag.
There is a similar post using w3c dom using insertBefore() in a . I was wondering how do this using dom4j. I want to insert <div id="dynamicdiv"/> as the first element of body. <html><head/><body>[<div id="dynamicdiv">] <many tags></body></html>
Thanks Mat Banik I managed to get your alternative approach to work, I am posting this for others to get benefited. You can get the advantage of both dom4j and w3c dom. Moreover there is only one tree constructed internally which makes the print(document.asXML()) to incorporate the manipulations done by w3c.
Listing
package playground;
import java.io.StringReader;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.dom.DOMDocument;
import org.dom4j.dom.DOMDocumentFactory;
import org.dom4j.io.SAXReader;
public class Dom4jInsertBefore {
public static void main(String[] args) throws DocumentException {
String newNode = "<node>value</node>"; // Convert this to XML
String text = "<root><given></given></root>";
// Document document = DocumentHelper.parseText(text); //type casting
// exception will come while converting to DOMDocument
// use DOMDocumentFactory
// Document newNodeDocument = DocumentHelper.parseText(newNode);
DOMDocumentFactory factory = new DOMDocumentFactory();
SAXReader reader2 = new SAXReader();
reader2.setDocumentFactory(factory);
org.dom4j.Document document = reader2.read(new StringReader(text));
Document newNodeDocument = reader2.read(new StringReader(newNode));
Element givenNode = document.getRootElement().element("given");
givenNode.add(newNodeDocument.getRootElement());
org.dom4j.dom.DOMDocument w3cDoc = (DOMDocument) document;
org.w3c.dom.Element e = w3cDoc.createElement("div");
e.setAttribute("id", "someattr");
w3cDoc.getDocumentElement().getFirstChild().insertBefore(e,
w3cDoc.getDocumentElement().getElementsByTagName("node").item(0));
// w3cDoc.getDocumentElement().getFirstChild().appendChild(e); this works
System.out.println(document.asXML());
}
}
Output
<?xml version="1.0" encoding="UTF-8"?>
<root><given><div id="someattr"/><node>value</node></given></root>`
dom4j supports DOM api as well. Look here for the same method.
I wrote this on the fly without testing but I think this should get you started:
import org.jdom.Attribute;
import org.jdom.output.Format;
import org.jdom.output.XMLOutputter;
import org.jdom.Document;
import org.jdom.Element;
Document document = new Document();
Element html = new Element("html");
Element body = new Element("body");
Element head = new Element("head");
Element div = new Element("div");
Attribute id = new Attribute("id", "dynamicdiv");
Element moreElements = new Element("moreElements");
document.setRootElement(html);
div.addContent("");
div.setAttribute(id)
moreElements.addContent("");
body.addContent(div);
body.addContent(moreElements);
html.addContent(head);
html.addContent(body);
Or alternatively you could use this method:
Node org.dom4j.dom.DOMElement.insertBefore(Node newChild, Node refChild)
I already have written a DOM parser for a large XML document format that contains a number of items that can be used to automatically generate Java code. This is limited to small expressions that are then merged into a dynamically generated Java source file.
So far - so good. Everything works.
BUT - I wish to be able to embed the line number of the XML node where the Java code was included from (so that if the configuration contains uncompilable code, each method will have a pointer to the source XML document and the line number for ease of debugging). I don't require the line number at parse-time and I don't need to validate the XML Source Document and throw an error at a particular line number. I need to be able to access the line number for each node and attribute in my DOM or per SAX event.
Any suggestions on how I might be able to achieve this?
P.S.
Also, I read the StAX has a method to obtain line number whilst parsing, but ideally I would like to achieve the same result with regular SAX/DOM processing in Java 4/5 rather than become a Java 6+ application or take on extra .jar files.
I know this thread is a little old (sorry), but it has taken me so long to crack this nut I had to share the solution with someone...
You only seem to be able to obtain the line numbers with SAX which doesn't build a DOM. The DOM parser does not give the line numbers, and neither does it let you near the SAX parser it is using. My solution is to do an empty XSLT transformation using a SAX source and a DOM result, but even then someone has done their best to hide this. See the code below.
I add the location information to each element as an attribute with my own namespace, so I can find elements using XPath and report where the data came from.
Hope this helps:
// The file to parse.
String systemId = "myxml.xml";
/*
* Create transformer SAX source that adds current element position to
* the element as attributes.
*/
XMLReader xmlReader = XMLReaderFactory.createXMLReader();
LocationFilter locationFilter = new LocationFilter(xmlReader);
InputSource inputSource = new InputSource(new FileReader(systemId));
// Do this so that XPath function document() can take relative URI.
inputSource.setSystemId(systemId);
SAXSource saxSource = new SAXSource(locationFilter, inputSource);
/*
* Perform an empty transformation from SAX source to DOM result.
*/
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMResult domResult = new DOMResult();
transformer.transform(saxSource, domResult);
Node root = domResult.getNode();
...
class LocationFilter extends XMLFilterImpl {
LocationFilter(XMLReader xmlReader) {
super(xmlReader);
}
private Locator locator = null;
#Override
public void setDocumentLocator(Locator locator) {
super.setDocumentLocator(locator);
this.locator = locator;
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
// Add extra attribute to elements to hold location
String location = locator.getSystemId() + ':' + locator.getLineNumber() + ':' + locator.getColumnNumber();
Attributes2Impl attrs = new Attributes2Impl(attributes);
attrs.addAttribute("http://myNamespace", "location", "myns:location", "CDATA", location);
super.startElement(uri, localName, qName, attrs);
}
}
I ran into this issue recently and I thought I'd share a ready made utility class for handling it. Works with Java 11, whereas some of Reg Whitton's code uses some now deprecated classes.
Mostly based on this article with a few tweaks. Notably, storing the line number as a the node's user data rather than setting it as an attribute.
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayDeque;
import java.util.Deque;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class XmlDom {
public static Document readXML(InputStream is, final String lineNumAttribName) throws IOException, SAXException {
final Document doc;
SAXParser parser;
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
parser = factory.newSAXParser();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.newDocument();
} catch(ParserConfigurationException e){
throw new RuntimeException("Can't create SAX parser / DOM builder.", e);
}
final Deque<Element> elementStack = new ArrayDeque<>();
final StringBuilder textBuffer = new StringBuilder();
DefaultHandler handler = new DefaultHandler() {
private Locator locator;
#Override
public void setDocumentLocator(Locator locator) {
this.locator = locator; //Save the locator, so that it can be used later for line tracking when traversing nodes.
}
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
addTextIfNeeded();
Element el = doc.createElement(qName);
for(int i = 0;i < attributes.getLength(); i++)
el.setAttribute(attributes.getQName(i), attributes.getValue(i));
el.setUserData(lineNumAttribName, String.valueOf(locator.getLineNumber()), null);
elementStack.push(el);
}
#Override
public void endElement(String uri, String localName, String qName){
addTextIfNeeded();
Element closedEl = elementStack.pop();
if (elementStack.isEmpty()) { // Is this the root element?
doc.appendChild(closedEl);
} else {
Element parentEl = elementStack.peek();
parentEl.appendChild(closedEl);
}
}
#Override
public void characters (char ch[], int start, int length) throws SAXException {
textBuffer.append(ch, start, length);
}
// Outputs text accumulated under the current node
private void addTextIfNeeded() {
if (textBuffer.length() > 0) {
Element el = elementStack.peek();
Node textNode = doc.createTextNode(textBuffer.toString());
el.appendChild(textNode);
textBuffer.delete(0, textBuffer.length());
}
}
};
parser.parse(is, handler);
return doc;
}
}
Access the line number with
node.getUserData(lineNumAttribName);