I have a requirement to create a sort of 'skeleton' xml based on an XSD schema.
The documents defined by these schemas have no namespace. They are authored by other developers, not in an automated way.
There is no mixed content allowed. That is, elements can contain elements only, or text only.
The rules for this sample xml are:
elements that can contain only text content should not be created in the sample xml
all other optional and mandatory elements should be included in the sample xml
elements should be created only once even if they can occur multiple times
any other nodes such as attributes, comments, processing instruction, etc. should be ommited - the sample xml would be an 'element tree'
Are there APIs or tools in Java that can generate such sample xml? I'm looking for pointers where to get started.
This needs to be done programmatically in a reliable way, as the sample xml is used by other XSLT transformations.
Hope below code will serve your purpose
package com.example.demo;
import java.io.File;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import jlibs.xml.sax.XMLDocument;
import jlibs.xml.xsd.XSInstance;
import jlibs.xml.xsd.XSParser;
public interface xsdtoxml {
public static void main(String[] pArgs) {
try {
String filename = "out.xsd";
// instance.
final Document doc = loadXsdDocument(filename);
//Find the docs root element and use it to find the targetNamespace
final Element rootElem = doc.getDocumentElement();
String targetNamespace = null;
if (rootElem != null && rootElem.getNodeName().equals("xs:schema"))
{
targetNamespace = rootElem.getAttribute("targetNamespace");
}
//Parse the file into an XSModel object
org.apache.xerces.xs.XSModel xsModel = new XSParser().parse(filename);
//Define defaults for the XML generation
XSInstance instance = new XSInstance();
instance.minimumElementsGenerated = 1;
instance.maximumElementsGenerated = 1;
instance.generateDefaultAttributes = true;
instance.generateOptionalAttributes = true;
instance.maximumRecursionDepth = 0;
instance.generateAllChoices = true;
instance.showContentModel = true;
instance.generateOptionalElements = true;
//Build the sample xml doc
//Replace first param to XMLDoc with a file input stream to write to file
QName rootElement = new QName(targetNamespace, "out");
XMLDocument sampleXml = new XMLDocument(new StreamResult(System.out), true, 4, null);
instance.generate(xsModel, rootElement, sampleXml);
} catch (TransformerConfigurationException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static Document loadXsdDocument(String inputName) {
final String filename = inputName;
final DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
factory.setIgnoringComments(true);
Document doc = null;
try {
final DocumentBuilder builder = factory.newDocumentBuilder();
final File inputFile = new File(filename);
doc = builder.parse(inputFile);
} catch (final Exception e) {
e.printStackTrace();
// throw new ContentLoadException(msg);
}
return doc;
}
}
xsd to xml :
1 : you can use eclipse (right click and select Generate)
2 : Sun/Oracle Multi-Schema Validator
3 : xmlgen
see:
How to generate sample XML documents from their DTD or XSD?
for subtle requirements, you should program it yourself
Related
I have a situation where I'd like to start using an XML Schema to validate documents that, until now, have never had a schema definition. As such, the existing documents I'd like to validate do not have any xmlns declaration in them.
I have no problem successfully validating a document which does include the xmlns declaration, but I'd also like to be able to validate those documents without such a declaration. I was hoping for something like this:
DocumentBuilderFactory dbf = ...;
dbf.setSchema(... my schema for namespace "foo:bar"...);
dbf.setValidating(false);
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setDefaultNamespace("foo:bar");
Document doc = db.parse(input);
There is no such method DocumentBuilder.setDefaultNamespace and so the schema validation is not performed when loading documents of this type.
Is there any way to force the namespace for a document if one is not set? Or does that require essentially parsing the XML without regard to schema, checking for an existing namespace, adjusting it, then re-validating the document with the schema?
I'm currently expecting the parser to perform validation during parsing, but I have no problem parsing first and then validating afterward.
UPDATE 2021-01-13
Here is a concrete example of what I'm trying to do, as a JUnit test case.
import java.io.IOException;
import java.io.StringReader;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import org.junit.Assert;
import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.ErrorHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
public class XMLSchemaTest
{
private static final String XMLNS = "http://www.example.com/schema";
private static final String schemaDocument = "<xs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" targetNamespace=\"" + XMLNS + "\" xmlns:e=\"" + XMLNS + "\" elementFormDefault=\"qualified\"><xs:element name=\"example\" type=\"e:exampleType\" /><xs:complexType name=\"exampleType\"><xs:sequence><xs:element name=\"test\" type=\"e:testType\" /></xs:sequence></xs:complexType><xs:complexType name=\"testType\" /></xs:schema>";
private static Document parse(String document) throws SAXException, ParserConfigurationException, IOException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
SchemaFactory sf = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
Source[] sources = new Source[] {
new StreamSource(new StringReader(schemaDocument))
};
Schema schema = sf.newSchema(sources);
dbf.setSchema(schema);
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setErrorHandler(new MyErrorHandler());
return db.parse(new InputSource(new StringReader(document)));
}
#Test
public void testConformingDocumentWithSchema() throws Exception {
String testDocument = "<example xmlns=\"" + XMLNS + "\"><test/></example>";
Document doc = parse(testDocument);
//Assert.assertEquals("Wrong document XML namespace", XMLNS, doc.getNamespaceURI());
Element root = doc.getDocumentElement();
Assert.assertEquals("Wrong root element XML namespace", XMLNS, root.getNamespaceURI());
Assert.assertEquals("Wrong element name", "example", root.getLocalName());
Assert.assertEquals("Wrong element name", "example", root.getTagName());
}
#Test
public void testConformingDocumentWithoutSchema() throws Exception {
String testDocument = "<example><test/></example>";
Document doc = parse(testDocument);
//Assert.assertEquals("Wrong document XML namespace", XMLNS, doc.getNamespaceURI());
Element root = doc.getDocumentElement();
Assert.assertEquals("Wrong root element XML namespace", XMLNS, root.getNamespaceURI());
Assert.assertEquals("Wrong element name", "example", root.getLocalName());
Assert.assertEquals("Wrong element name", "example", root.getTagName());
}
#Test
public void testNononformingDocumentWithSchema() throws Exception {
String testDocument = "<example xmlns=\"" + XMLNS + "\"><random/></example>";
try {
parse(testDocument);
Assert.fail("Document should not have parsed properly");
} catch (Exception e) {
System.out.println(e);
// Expected
}
}
#Test
public void testNononformingDocumentWithoutSchema() throws Exception {
String testDocument = "<example><random/></example>";
try {
parse(testDocument);
Assert.fail("Document should not have parsed properly");
} catch (Exception e) {
System.out.println(e);
// Expected
}
}
public static class MyErrorHandler implements ErrorHandler {
#Override
public void warning(SAXParseException exception) throws SAXException {
System.err.println("WARNING: " + exception);
}
#Override
public void error(SAXParseException exception) throws SAXException {
throw exception;
}
#Override
public void fatalError(SAXParseException exception) throws SAXException {
System.err.println("FATAL: " + exception);
}
}
}
All of the tests pass except for testConformingDocumentWithoutSchema. I think this is kind of expected, as the document declares no namespace.
I'm asking how the test can e changed (but not the document itself!) so that I can validate the document against a schema that was not actually declared by the document.
I pounded on this for a while, and I was able to come up with a hack that works. It may be possible to do this more elegantly (which was my original question), and it also may be possible to do this with less code, but this was what I was able to come up with.
If you look at the JUnit test case in the question, changing the "parse" method to the following (and adding XMLNS as the second argument to all calls to parse) will allow all tests to complete:
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;
...
private static Document parse(String document, String namespace) throws SAXException, ParserConfigurationException, IOException {
SchemaFactory sf = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
Source[] sources = new Source[] {
new StreamSource(new StringReader(schemaDocument))
};
Schema schema = sf.newSchema(sources);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setSchema(schema);
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
ErrorHandler errorHandler = new MyErrorHandler();
db.setErrorHandler(errorHandler);
try {
return db.parse(new InputSource(new StringReader(document)));
} catch (SAXParseException spe) {
// Just in case this was a problem with a missing namespace
// System.out.println("Possibly recovering from SPE " + spe);
// New DocumentBuilder without the schema
dbf.setSchema(null);
db = dbf.newDocumentBuilder();
db.setErrorHandler(errorHandler);
Document doc = db.parse(new InputSource(new StringReader(document)));
if(null != doc.getDocumentElement().getNamespaceURI()) {
// Namespace URI was set; this is a fatal error
throw spe;
}
// Override the namespace on the Document + root element
doc.getDocumentElement().setAttribute("xmlns", namespace);
// Serialize the document -> String to start over again
DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
LSOutput lsOutput = domImplementation.createLSOutput();
lsOutput.setEncoding("UTF-8");
StringWriter out = new StringWriter();
lsOutput.setCharacterStream(out);
lsSerializer.write(doc, lsOutput);
String converted = out.toString();
// Re-enable the schema
dbf.setSchema(schema);
db = dbf.newDocumentBuilder();
db.setErrorHandler(errorHandler);
return db.parse(new InputSource(new StringReader(converted)));
}
}
This works by catching SAXParseException and, because SAXParseException doesn't give up any of its details, assuming that the problem might be due to a missing XML namespace declaration. I then re-parse the document without the schema validation, add a namespace declaration to the in-memory Document, then serialize the Document to String and re-parse the document with the schema validation re-enabled.
I tried to do this just by setting the XML namespace and then using Schema.newValidator().validate(new DOMSource(doc)), but this failed validation every time for me. Running through the serializer got around that problem.
I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.
I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file
Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf
Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.
Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.
I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}
Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);
I've an xml file that I would avoid having to load all in memory.
As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)
My current problem is that I would like to process the file "by chunk" which means:
Parse the file and find a relevant tag (node)
Load this tag entirely in memory (like we would do it in DOM)
Do the process of this entity (that local chunk)
When I'm done with the chunk, release it and continue to 1. (until "end of file")
In a perfect world I'm searching some something like this:
// 1. Create a parser and set the file to load
IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
void aNodeIsFound(saxNode aNode)
{
// 5. Inflate the current node i.e. load it (and all its content) in memory
DomNode d = aNode.expand();
// 6. Do something with the inflated node (method to be defined somewhere)
doThingWithNode(d);
}
});
// 7. Start the parser
p.start();
I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.
Is there any Java framework or library relevant to this kind of task?
UPDATE
You could also just use the javax.xml.xpath APIs:
package forum7998733;
import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
public class XPathDemo {
public static void main(String[] args) throws Exception {
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
InputSource xml = new InputSource(new FileReader("BigFile.xml"));
Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
System.out.println(result);
}
}
Below is a sample of how it could be done with StAX.
input.xml
Below is some sample XML:
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
Demo
In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.
package forum7998733;
import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
DOMResult domResult = new DOMResult();
t.transform(new StAXSource(xsr), domResult);
DOMSource domSource = new DOMSource(domResult.getNode());
StreamResult streamResult = new StreamResult(System.out);
t.transform(domSource, streamResult);
}
}
}
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
...stuff...
</statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
...stuff...
</statement>
It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple / delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.
As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.
The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.
StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.
There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.
Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.
Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.
When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.
Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.
Good luck!
EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:
package staxdom;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class DOMExtractor {
private final Set<String> paths;
private final XMLInputFactory inputFactory;
private final XMLOutputFactory outputFactory;
private final DocumentBuilderFactory docBuilderFactory;
private final Stack<QName> activeStack = new Stack<QName>();
private boolean active = false;
private String currentPath = "";
public DOMExtractor(final Set<String> paths) {
this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
inputFactory = XMLInputFactory.newFactory();
outputFactory = XMLOutputFactory.newFactory();
docBuilderFactory = DocumentBuilderFactory.newInstance();
}
public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {
final XMLEventReader reader = inputFactory.createXMLEventReader(input);
XMLEventWriter writer = null;
StringWriter buffer = null;
final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();
XMLEvent currentEvent = reader.nextEvent();
do {
if(active)
writer.add(currentEvent);
if(currentEvent.isEndElement()) {
if(active) {
activeStack.pop();
if(activeStack.isEmpty()) {
writer.flush();
writer.close();
final Document doc;
final StringReader docReader = new StringReader(buffer.toString());
try {
doc = builder.parse(new InputSource(docReader));
} finally {
docReader.close();
}
//TODO: use doc
//Next bit is only for demo...
outputDoc(doc);
active = false;
writer = null;
buffer = null;
}
}
int index;
if((index = currentPath.lastIndexOf('/')) >= 0)
currentPath = currentPath.substring(0, index);
} else if(currentEvent.isStartElement()) {
final StartElement start = (StartElement)currentEvent;
final QName qName = start.getName();
final String local = qName.getLocalPart();
currentPath += "/" + local;
if(!active && paths.contains(currentPath)) {
active = true;
buffer = new StringWriter();
writer = outputFactory.createXMLEventWriter(buffer);
writer.add(currentEvent);
}
if(active)
activeStack.push(qName);
}
currentEvent = reader.nextEvent();
} while(!currentEvent.isEndDocument());
}
private void outputDoc(final Document doc) {
try {
final Transformer t = TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(doc), new StreamResult(System.out));
System.out.println("");
System.out.println("");
} catch(TransformerException ex) {
ex.printStackTrace();
}
}
public static void main(String[] args) {
final Set<String> paths = new HashSet<String>();
paths.add("/root/one");
paths.add("/root/three/embedded");
final DOMExtractor me = new DOMExtractor(paths);
InputStream stream = null;
try {
stream = DOMExtractor.class.getResourceAsStream("sample.xml");
me.parse(stream);
} catch(final Exception e) {
e.printStackTrace();
} finally {
if(stream != null)
try {
stream.close();
} catch(IOException ex) {
ex.printStackTrace();
}
}
}
}
And the sample.xml file (should be in the same package):
<?xml version="1.0" encoding="UTF-8"?>
<root>
<one>
<two>this is text</two>
look, I can even handle mixed!
</one>
... not sure what to do with this, though
<two>
<willbeignored/>
</two>
<three>
<embedded>
<and><here><we><go>
Creative Commons Legal Code
Attribution 3.0 Unported
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
DAMAGES RESULTING FROM ITS USE.
License
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
CONDITIONS.
</go></we></here></and>
</embedded>
</three>
</root>
EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .
ok thanks to your pieces of code, I finally end up with my solution:
Usage is quite intuitive:
try
{
/* CREATE THE PARSER */
XMLParser parser = new XMLParser();
/* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
/* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
XMLHandler handler = new XMLHandler()
{
public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
{
// DO SOMETHING WITH THE FOUND XML NODE
System.out.println("Node found");
System.out.println(node.toString());
}
};
/* ATTACH THE FILTER WITH THE HANDLER */
parser.addFilterWithHandler(filter, handler);
/* SET THE FILE TO PARSE */
parser.setFilePath("/path/to/bigfile.xml");
/* RUN THE PARSER */
parser.parse();
}
catch (Exception ex)
{
ex.printStackTrace();
}
Note:
I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely
All the of the code is here:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;
/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {
abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}
/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}
/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {
abstract boolean isRelevant(Stack fullPath);
}
/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {
Pattern relevantExpression;
XMLRegexFilter(String filterRules) {
relevantExpression = Pattern.compile(filterRules);
}
/* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
* OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
#Override
public boolean isRelevant(Stack fullPath) {
/* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
* ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
* FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
String stackPath = XMLParser.StackToString(fullPath);
Matcher m = relevantExpression.matcher(stackPath);
return m.matches();
}
}
/* THE MAIN PARSER'S CLASS */
public class XMLParser {
HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
HashMap<Integer, Integer> feedingStreams;
Stack<HashMap> currentStack;
String filePath;
XMLParser() {
currentStack = new <HashMap>Stack();
filterHandler = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
feedingStreams = new <Integer, Integer>HashMap();
}
public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
filterHandler.put(f, h);
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
/* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT
* I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
* DO SO
*/
public static String StackToString(Stack<HashMap> s) {
int k = s.size();
if (k == 0) {
return null;
}
StringBuilder out = new StringBuilder();
out.append(s.get(0).get("tag"));
for (int x = 1; x < k; ++x) {
HashMap node = s.get(x);
out.append('/').append(node.get("tag"));
/*
// UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH
ArrayList <String[]>attributes = (ArrayList)node.get("attr");
if (attributes.size()>0)
{
out.append("[");
for (int i = 0 ; i<attributes.size(); i++)
{
String[]keyValuePair = attributes.get(i);
if (i>0) out.append(",");
out.append(keyValuePair[0]);
out.append("=\"");
out.append(keyValuePair[1]);
out.append("\"");
}
out.append("]");
}*/
}
return out.toString();
}
/*
* ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
* WE THEN RETRIEVE THE DATA FROM IT.
*/
private StringBuilder getChunk(int from, int to) throws Exception {
int length = to - from;
FileReader f = new FileReader(filePath);
BufferedReader br = new BufferedReader(f);
br.skip(from);
char[] readb = new char[length];
br.read(readb, 0, length);
StringBuilder b = new StringBuilder();
b.append(readb);
return b;
}
/* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
HashMap h = new HashMap();
ArrayList attributes = new ArrayList();
for (int i = 0; i < xsr.getAttributeCount(); i++) {
String[] s = new String[2];
s[0] = xsr.getAttributeName(i).toString();
s[1] = xsr.getAttributeValue(i);
attributes.add(s);
}
h.put("tag", xsr.getName());
h.put("attr", attributes);
return h;
}
public void parse() throws Exception {
FileReader f = new FileReader(filePath);
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(f);
Location previousLoc = xsr.getLocation();
while (xsr.hasNext()) {
switch (xsr.next()) {
case XMLStreamConstants.START_ELEMENT:
currentStack.add(XSRNode2HashMap(xsr));
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
}
}
previousLoc = xsr.getLocation();
break;
case XMLStreamConstants.END_ELEMENT:
Integer stream = null;
if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
// FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
for (XMLStackFilter filter : filterHandler.keySet()) {
if (filter.isRelevant(currentStack)) {
XMLNodeFoundNotifier h = filterHandler.get(filter);
StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
h.nodeFound(aChunk, filter);
}
}
feedingStreams.remove(currentStack.hashCode());
}
previousLoc = xsr.getLocation();
currentStack.pop();
break;
default:
break;
}
}
}
}
A little while since i did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then run your process, clear it out and look for the next start tag.
In my XML file I have some entities such as ’
So I have created a DTD tag for my XML document to define these entities. Below is the Java code used to read the XML file.
SAXBuilder builder = new SAXBuilder();
URL url = new URL("http://127.0.0.1:8080/sample/subject.xml");
InputStream stream = url.openStream();
org.jdom.Document document = builder.build(stream);
Element root = document.getRootElement();
Element name = root.getChild("name");
result = name.getText();
System.err.println(result);
How can I change the Java code to retrieve a DTD over HTTP to allow the parsing of my XML document to be error free?
Simplified example of the xml document.
<main>
<name>hello ‘ world ’ foo & bar </name>
</main>
One way to do this would be to read the document and then validate it with the transformer:
import java.net.URL;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
public class ValidateWithExternalDTD {
private static final String URL = "http://127.0.0.1:8080/sample/subject.xml";
private static final String DTD = "http://127.0.0.1/YourDTD.dtd";
public static void main(String args[]) {
try {
DocumentBuilderFactory factory= DocumentBuilderFactory.newInstance();
factory.setValidating(true);
DocumentBuilder builder = factory.newDocumentBuilder();
// Set the error handler
builder.setErrorHandler(new org.xml.sax.ErrorHandler() {
public void fatalError(SAXParseException spex)
throws SAXException {
// output error and exit
spex.printStackTrace();
System.exit(0);
}
public void error(SAXParseException spex)
throws SAXParseException {
// output error and continue
spex.printStackTrace();
}
public void warning(SAXParseException spex)
throws SAXParseException {
// output warning and continue
spex.printStackTrace();
}
});
// Read the document
URL url = new URL(ValidateWithExternalDTD.URL);
Document xmlDocument = builder.parse(url.openStream());
DOMSource source = new DOMSource(xmlDocument);
// Use the tranformer to validate the document
StreamResult result = new StreamResult(System.out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, ValidateWithExternalDTD.DTD);
transformer.transform(source, result);
// Process your document if everything is OK
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
Another way would be to replace the XML title with the XML title plus the DTD reference
Replace this:
<?xml version = "1.0"?>
with this:
<?xml version = "1.0"?><!DOCTYPE ...>
Of course you would replace the first occurance only and not try to go through the whole xml document
You have to instantiate the SAXBuilder by passing true(validate) to its constructor:
SAXBuilder builder = new SAXBuilder(true);
or call:
builder.setValidation(true)
I am looking for example Java code that can construct an XML document that uses namespaces. I cannot seem to find anything using my normal favourite tool so was hoping someone may be able to help me out.
There are a number of ways of doing this. Just a couple of examples:
Using XOM
import nu.xom.Document;
import nu.xom.Element;
public class XomTest {
public static void main(String[] args) {
XomTest xomTest = new XomTest();
xomTest.testXmlDocumentWithNamespaces();
}
private void testXmlDocumentWithNamespaces() {
Element root = new Element("my:example", "urn:example.namespace");
Document document = new Document(root);
Element element = new Element("element", "http://another.namespace");
root.appendChild(element);
System.out.print(document.toXML());
}
}
Using Java Implementation of W3C DOM
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;
public class DomTest {
private static DocumentBuilderFactory dbf = DocumentBuilderFactory
.newInstance();
public static void main(String[] args) throws Exception {
DomTest domTest = new DomTest();
domTest.testXmlDocumentWithNamespaces();
}
public void testXmlDocumentWithNamespaces() throws Exception {
DocumentBuilder db = dbf.newDocumentBuilder();
DOMImplementation domImpl = db.getDOMImplementation();
Document document = buildExampleDocumentWithNamespaces(domImpl);
serialize(domImpl, document);
}
private Document buildExampleDocumentWithNamespaces(
DOMImplementation domImpl) {
Document document = domImpl.createDocument("urn:example.namespace",
"my:example", null);
Element element = document.createElementNS("http://another.namespace",
"element");
document.getDocumentElement().appendChild(element);
return document;
}
private void serialize(DOMImplementation domImpl, Document document) {
DOMImplementationLS ls = (DOMImplementationLS) domImpl;
LSSerializer lss = ls.createLSSerializer();
LSOutput lso = ls.createLSOutput();
lso.setByteStream(System.out);
lss.write(document, lso);
}
}
I am not sure, what you trying to do, but I use jdom for most of my xml-issues and it supports namespaces (of course).
The code:
Document doc = new Document();
Namespace sNS = Namespace.getNamespace("someNS", "someNamespace");
Element element = new Element("SomeElement", sNS);
element.setAttribute("someKey", "someValue", Namespace.getNamespace("someONS", "someOtherNamespace"));
Element element2 = new Element("SomeElement", Namespace.getNamespace("someNS", "someNamespace"));
element2.setAttribute("someKey", "someValue", sNS);
element.addContent(element2);
doc.addContent(element);
produces the following xml:
<?xml version="1.0" encoding="UTF-8"?>
<someNS:SomeElement xmlns:someNS="someNamespace" xmlns:someONS="someOtherNamespace" someONS:someKey="someValue">
<someNS:SomeElement someNS:someKey="someValue" />
</someNS:SomeElement>
Which should contain everything you need. Hope that helps.