Parsing multiple XML files using multithreading with SAXParser - java

I have a timer which checks for new XML files on the file system and parses them. XML files can get large(5GB), so i am using sax parser. To increase the productivity, i wrote a multithreading programm with executer service.
XML files can belong to different sources. For each source is being a thread created and in this thread XML files parsed which belong to the source. In every thread a new SaxParserFactory and for every XML file a new SaxParser created.
The problem is that different parser kill each others process. When i check the parse results, i notice that some of the XML's haven't been parsed completely. The parser quits halfway and doesn't throw any exception. I dont have the problem when the XML files is being parsed in single thread.
Now i am not sure, if the SaxParserFactory and SaxParser really create new instances.
Do you guys have any idea what might cause this?
SAXParser parser = factory.newSAXParser();
AccountSaxHandler saxHandler = new AccountSaxHandler();
parser.parse(new File(localFilePath), saxHandler);

Related

Java SAX parser, How do I prevent character references entirely? (DoS attack)

The XML files of incoming request needs to be validated. One requierement is that character references are prevented entirely because of possible DoS attacks. If I configure the SAXParserFactory like below:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
then the parer still resolves 100.000 entity expansions.
The parser has encountered more than "100.000" entity expansions in this document; this is the limit imposed by the application.
The prevention of external references was done via an EntityResolver which works fine. But how do I prevent the character references?
Character references cannot cause a denial of service attack, so there is no reason to prevent them.
An instance of org.apache.xerces.util.SecurityManager can limit the amount of entity expansions. Here's the an example.
SAXParser saxParser = spf.newSAXParser();
org.apache.xerces.util.SecurityManager mgr = new org.apache.xerces.util.SecurityManager();
mgr.setEntityExpansionLimit(-1);
saxParser.setProperty("http://apache.org/xml/properties/security-manager", mgr);
With this, the parsing process terminates if the XML file contains at least one entity reference. Now there's no more need for an EntityResolver.
The jar file which contains the SecurityManager can be downloaded here.

validating a schema file in local location with saxparser

I was looking at http://docs.oracle.com/javaee/1.4/tutorial/doc/JAXPSAX9.html.
You can associate the xml file with a schema with 2 ways, in the app or in the xml document. In the app you call
saxParser.setProperty(JAXP_SCHEMA_SOURCE,
new File(schemaSource));
in the xml you add this
<documentRoot
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation='YourSchemaDefinition.xsd'
>
The problem is that both locations for the .xsd file are URL strings. The .xsd file i have is a local copy. Is there a way to specify the location? maybe as an input stream?
You can set the schema directly on the SAX Parser factory.
SAXParserFactory factory = SAXParserFactory.newInstance();
SchemaFactory schemafactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema sc = schemafactory.newSchema(new File("path to xsd file"));
factory.setSchema(sc);
SAXParser parser = factory.newSAXParser();
parser.parse(file, handler);
The xsd location in the xml file can also be relative to the xml file, so if your xsd is present along with the xml file locally then your current xml file should work.
I assume you're in java. If the schema is in the classpath, you can probably use this post to get it : URL to load resources from the classpath in Java
Having the schemaLocation in instance can be hard to handle if you receive the XML file from a third party. The schemaLocation may be already defined in the XML and may lead to a wrong schema (or to nothing at all). If you want to add it programmatically, you will have to change integrity of data before validation, it can be risky. For validation, IMO, better trust your local copy.

Howto let the SAX parser determine the encoding from the xml declaration?

I'm trying to parse xml files from different sources (over which I have little control). Most of the them are encoded in UTF-8 and don't cause any problems using the following snippet:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
InputSource is = new InputSource(getInputStream());
parser.parse(is, handler);
Since SAX defaults to UTF-8 this is fine. However some of the documents declare:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even though ISO-8859-1 is declared SAX still defaults to UTF-8.
Only if I add:
is.setEncoding("ISO-8859-1");
Will SAX use the correct encoding.
How can I let SAX automatically detect the correct encoding from the xml declaration without me specifically setting it? I need this because I don't know before hand what the encoding of the file will be.
Thanks in advance,
Allan
Use InputStream as argument to InputSource when you want Sax to autodetect the encoding.
If you want to set a specific encoding, use Reader with a specified encoding or setEncoding method.
Why? Because autodetection encoding algorithms require raw data, not converted to characters.
The question in the subject is: How to let the SAX parser determine the encoding from the xml declaration? I found Allan's answer to the question misleading and I provided the alternative one, based on Jörn Horstmann's comment and my later experience.
I found the answer myself.
The SAX parser uses InputSource internally and from the InputSource docs:
The SAX parser will use the
InputSource object to determine how to
read XML input. If there is a
character stream available, the parser
will read that stream directly,
disregarding any text encoding
declaration found in that stream. If
there is no character stream, but
there is a byte stream, the parser
will use that byte stream, using the
encoding specified in the InputSource
or else (if no encoding is specified)
autodetecting the character encoding
using an algorithm such as the one in
the XML specification. If neither a
character stream nor a byte stream is
available, the parser will attempt to
open a URI connection to the resource
identified by the system identifier.
So basically you need to pass a character stream to the parser for it to pick-up the correct encoding. See solution below:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
FeedHandler handler = new FeedHandler();
Reader isr = new InputStreamReader(getInputStream());
InputSource is = new InputSource();
is.setCharacterStream(isr);
parser.parse(is, handler);

XML Validation: Am I Doing It Right?

I was just wondering if someone could give my XML validation code a once over to see if I'm doing it right. Here's the portion of code that is giving me the trouble...
SAXParserFactory factory = SAXParserFactory.newInstance();
SchemaFactory schemaFactory = SchemaFactory
.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
// *** CODE FAILS ON THE BELOW LINE **/
factory.setSchema(schemaFactory
.newSchema(new Source[] { new StreamSource(schemaStream) }));
SAXParser parser = factory.newSAXParser();
SAXReader reader = new SAXReader(parser.getXMLReader());
reader.setValidation(false);
reader.setErrorHandler(new ResultProducingErrorHandler());
reader.read(content);
Whenever I run the above code, I get an error along the lines of:
src-resolve: Cannot resolve the name 'ns:myStructure' to a(n) 'type definition' component.
The elements mentioned in the error messages are all ones that are imported into the schema via calls to <xs:import />. The schema seems to validate OK via the W3C XML Schema Validator.
Do I have to include each of these schema's individually or is Java smart enough to go off and fetch these extra schema's too? I tried adding them in the array passed to the newSchema call but that didn't make any difference.
I don't think I can give out the link to the schema, so I'm really just looking for a yes or no regarding if my code looks at least acceptable.
Ensure that the xs:import statements point to paths that are reachable from the current directory of your application. The current directory may not be what you think it is.

Validating a HUGE XML file

I'm trying to find a way to validate a large XML file against an XSD. I saw the question ...best way to validate an XML... but the answers all pointed to using the Xerces library for validation. The only problem is, when I use that library to validate a 180 MB file then I get an OutOfMemoryException.
Are there any other tools,libraries, strategies for validating a larger than normal XML file?
EDIT: The SAX solution worked for java validation, but the other two suggestions for the libxml tool were very helpful as well for validation outside of java.
Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new SimpleErrorHandler());
reader.parse(new InputSource(new FileReader ("document.xml")));
Use libxml, which performs validation and has a streaming mode.
Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.
SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass

Categories