Validating a HUGE XML file

Validating a HUGE XML file - java

I'm trying to find a way to validate a large XML file against an XSD. I saw the question ...best way to validate an XML... but the answers all pointed to using the Xerces library for validation. The only problem is, when I use that library to validate a 180 MB file then I get an OutOfMemoryException.
Are there any other tools,libraries, strategies for validating a larger than normal XML file?
EDIT: The SAX solution worked for java validation, but the other two suggestions for the libxml tool were very helpful as well for validation outside of java.

Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new SimpleErrorHandler());
reader.parse(new InputSource(new FileReader ("document.xml")));

Use libxml, which performs validation and has a streaming mode.

Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.

SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass

Related

Saxon: Can't open XML with schema in .NET, works fine in Java

I am trying to create a Saxon XPathCompiler. I have the same code in Java & .NET, each calling the appropriate Saxon library. The code is:
protected void ctor(InputStream xmlData, InputStream schemaFile, boolean preserveWhiteSpace) throws SAXException, SchemaException, SaxonApiException {
this.rootNode = makeDataSourceNode(null);
XMLReader reader = XMLReaderFactory.createXMLReader();
InputSource xmlSource = new InputSource(xmlData);
SAXSource saxSource = new SAXSource(reader, xmlSource);
Source schemaSource = new StreamSource(schemaFile);
Configuration config = createEnterpriseConfiguration();
config.addSchemaSource(schemaSource);
// ...
In the case of .NET the InputStreams are a class that wrpas a .NET Stream and makes it a Java InputStream. For Java the above code works fine. But in .NET, the last line, config.addSchemaSource(schemaSource) throws:
$exception {"Content is not allowed in
prolog."} org.xml.sax.SAXParseException
In both Java & .NET it works fine if there is no schema.
The files it is using are http://www.thielen.com/test/SouthWind.xml & http://www.thielen.com/test/SouthWind.xsd
It does not appear to be any of the issues in this question. And if that was the issue, shouldn't both Java and .NET have the same problem.
I'm thinking maybe it's the wrapper around the .NET Stream to make it a Java InputStream, but we use that class everywhere without any other issues.

The "content is not allowed in Prolog" exception is absolutely infuriating - if only it told you what the bytes are that it is complaining about! One diagnostic technique is to display the initial bytes delivered by the InputStream: do a few calls on
System.err.println(schemaFile.next())
My first guess as to the cause would be something to do with byte order marks, but rather than speculate, I would focus on diagnostics to see what the parser is seeing in that InputStream that it doesn't like.

Optimization on VTD-XML parse?

I have to make performance test on VTD-XML library in order to make not just simple parsing but additional transformation in the parsing.
So I have 30MB input XML and then I transform it with custom logic to other XML.
SO I want to remove all thinks which slow the whole process which comes from my side(because of not good use of VTD library).
I tried to search tips for optimization but can not find them.
I noutised that:
'0'. What is better to use for selection selectXPath, or selectElement?
Use parsing without namespace is much faster.
File file = new File(fileName);
VTDGen vtdGen = new VTDGen();
vtdGen.setDoc_BR(new byte[(int) file.length()]);
vtdGen.parse(false);
Read from byte or pass to VTDGen ?
final VTDGen vg = new VTDGen();
vg.parseFile("books.xml", false);
or
// open a file and read the content into a byte array
File f = new File("books.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[(int) f.length()];
fis.read(b);
VTDGen vg = new VTDGen();
vg.setDoc(b);
vg.parse(true);
Using the second approach - 0.01 times faster...(can be from everything)
What is the difference with parseFile the file is limited upTo 2GB with namespaceaware true and 1GB witout but what for the byte approach?
Reuse buffers
You can ask VTDGen to reuse VTD buffers for the next parsing task.
Otherwise, by default, VTDGen will allocate new buffer for each
parsing run.
Can you give an example for that?
Adjust LC level to 5
By default, it is 3. But you can set it to 5. When your XML are deeply
nested, setting LC level to 5 results in better XPath performance. But
it increases memory usage and parsing time very slightly.
VTDGen vg = new VTDGen();
vtdGen.selectLcDepth(5);
But have runtime exception. Only works with 3
Indexing
Use VTD+XML indexing- Instead of parsing XML files at the time of
processing request, you can pre-index your XML into VTD+XML format and
dump them on disk. When the processing request commences, simply load
VTD+xml in memory and voila, parsing is no longer needed!!
VTDGen vg = new VTDGen();
if (vg.parseFile(inputName,true)){
vg.writeIndex(new FileOutputStream(outputName));
}
Can anyone knows how to use it? What happens if the file is changes, how to tripper new re-indexing. And if there is 10kb change in 3GB does the parsing will take time for the whole new file parsing or just for the changed lines?
overwrite feature
The overwrite feature aka. data templating- Because VTD-XML retains
XML in memory as is, you can actually create a template XML file
(pre-indexed in vtd+xml) whose value fields are left blank and let
your app fill in the blank, thus creating XML data that never need to
be parsed.

I think you should look at the examples bundled with vtd-xml release... and build up the expertise gradually... fortunately, vtd-xml is in my view one of the easiest XML API by a large margin... so the learning curve won't be SAX/STAX kind of difficult.
My answer to your numbered lists above...
selectXPath is for xpath evaluation. selectElement is similar to getElementByTag()
turning on Namespace awareness has little/no effect on parsing performance whatsoever... can you reference the source of your 100x slowdown claim?
you can read from bytes or read from files directly... here is a link to a blog post
https://ximpleware.wordpress.com/2016/06/02/parsefile-vs-parse-a-quick-comparison/
3.Buffer reuse is somewhat an advanced feature..let's get to that at a later time
4.If you get the latest version (2.13), you will not get runtime exception with that method call...
to parse xml doc larger than 2GB, you need to switch to extended edition of vtd-xml which is a separate API bundled with standard vtd-xml...
There are examples bundled with vtd-xml distribution that you might want to look at first... here is an article on this subject
http://www.codeproject.com/Articles/24663/Index-XML-Documents-with-VTD-XML

Java SAX parser, How do I prevent character references entirely? (DoS attack)

The XML files of incoming request needs to be validated. One requierement is that character references are prevented entirely because of possible DoS attacks. If I configure the SAXParserFactory like below:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
then the parer still resolves 100.000 entity expansions.
The parser has encountered more than "100.000" entity expansions in this document; this is the limit imposed by the application.
The prevention of external references was done via an EntityResolver which works fine. But how do I prevent the character references?

Character references cannot cause a denial of service attack, so there is no reason to prevent them.

An instance of org.apache.xerces.util.SecurityManager can limit the amount of entity expansions. Here's the an example.
SAXParser saxParser = spf.newSAXParser();
org.apache.xerces.util.SecurityManager mgr = new org.apache.xerces.util.SecurityManager();
mgr.setEntityExpansionLimit(-1);
saxParser.setProperty("http://apache.org/xml/properties/security-manager", mgr);
With this, the parsing process terminates if the XML file contains at least one entity reference. Now there's no more need for an EntityResolver.
The jar file which contains the SecurityManager can be downloaded here.

Howto let the SAX parser determine the encoding from the xml declaration?

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan

Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.

I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

How to create XML file?

I have some data which my program discovers after observing a few things about files.
For instance, i know file name, time file was last changed, whether file is binary or ascii text, file content (assuming it is properties) and some other stuff.
i would like to store this data in XML format.
How would you go about doing it?
Please provide example.

If you want something quick and relatively painless, use XStream, which lets you serialise Java Objects to and from XML. The tutorial contains some quick examples.

Use StAX; it's so much easier than SAX or DOM to write an XML file (DOM is probably the easiest to read an XML file but requires you to have the whole thing in memory), and is built into Java SE 6.
A good demo is found here on p.2:
OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);
writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeAttribute("id", "g1");
writer.writeCharacters("Hello StAX");
writer.writeEndDocument();
writer.flush();
writer.close();
out.close();

Standard are the W3C libraries.
final Document docToSave = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
final Element fileInfo = docToSave.createElement("fileInfo");
docToSave.appendChild(fileInfo);
final Element fileName = docToSave.createElement("fileName");
fileName.setNodeValue("filename.bin");
fileInfo.appendChild(fileName);
return docToSave;
XML is almost never the easiest thing to do.

You can use to do that SAX or DOM, review this link: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044810.html
I think is that you want

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Validating a HUGE XML file - java

Use libxml, which performs validation and has a streaming mode.

Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.

SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass

Related

Saxon: Can't open XML with schema in .NET, works fine in Java

Optimization on VTD-XML parse?

Java SAX parser, How do I prevent character references entirely? (DoS attack)

Howto let the SAX parser determine the encoding from the xml declaration?

How to create XML file?

Categories

Resources