My use case is:
XML File as input
Needs to be transformed with XSLT (using Java 8 build-in XSLT processor)
Process the Result with a XMLStreamReader (using Java 8 build-in StAX implementation)
I want to do this in a "streaming mode" (so not writing the output of the XSL Transformation to a File and than parsing it by the XMLStreamReader).
Is this possible? If so, how? I can only find SAX-based examples.
Probably not possible. Most XSLT engines write the result "tree" in push mode, so you can skip the creating of a physical tree by accepting events as they occur, but if you want to get the result in pull mode you're going to need to run the transformation in one thread and read the results in another thread, using something like a BlockingQueue to communicate the events from on thread to another. Saxon has internal mechanisms to handle push-pull conflicts using multiple threads in this way, but only at item level, not at event level, and this isn't much use to you because the whole result tree is one item. So the basic answer is, the only way I can see to do what you want is to write a multi-threaded SAX-to-StAX converter, which is a tricky bit of Java coding.
When you run a transform the XSLT processor needs to build its result within the transform call.
It is easy to see that the engine can write the result to a stream, or build up a DOM (or any other in memory structure), or call a SAXHandler.
But how should the engine directly accept and invoke a XMLStreamReader which has a different programming model than a callback oriented SAX Handler?
But you could create a DOM tree as result of the transformation. Given the DOM it is possible to create a DOMXmlStreamReader which iterates over the DOM and emits Stax tokens. But I don't know if such an implementation exists.
Related
Is it possible to create a "TransformerOutputStream", which extends the standard java.io.OutputStream, wraps a provided output stream and applies an XSL transformation? I can't find any combination of APIs which allows me to do this.
The key point is that, once created, the TransformerOutputStream may be passed to other APIs which accept a standard java.io.OutputStream.
Minimal usage would be something like:
java.io.InputStream in = getXmlInput();
java.io.OutputStream out = getTargetOutput();
javax.xml.transform.Templates templates = createReusableTemplates(); // could also use S9API
TransformerOutputStream tos = new TransformerOutputStream(out, templates); // extends OutputStream
com.google.common.io.ByteStreams.copy(in, tos);
// possibly flush/close tos if required by implementation
That's a JAXP example, but as I'm currently using Saxon an S9API solution would be fine too.
The main avenue I've persued is along the lines of:
a class which extends java.io.OutputStream and implements org.xml.sax.ContentHandler
an XSL transformer based on an org.xml.sax.ContentHandler
But I can't find implementations of either of these, which seems to suggest that either no one else has ever tried to do this, there is some problem which makes it impractical, or my search skills just are not that good.
I can understand that with some templates an XML transformer may require access to the entire document and so a SAX content handler may provide no advantage, but there must also be simple transformations which could be applied to the stream as it passed through? This kind of interface would leave that decision up to the transformer implementation.
I have a written and am currently using a class which provides this interface, but it just collects the output data in an internal buffer then uses a standard JAXP StreamSource to read that on flush or close, so ends up buffering the entire document.
You could make your TransformerOutputStream extend ByteArrayOutputStream, and its close() method could take the underlying byte[] array, wrap it in a ByteArrayInputStream, and invoke a transformation with the input taken from this InputStream.
But it seems you also want to avoid putting the entire contents of the stream in memory. So let's assume that the transformation you want to apply is an XSLT 3.0 streamable transformation. Unfortunately, although Saxon as a streaming XSLT transformer operates largely in push mode (by "push" I mean that the data supplier invokes the data consumer, whereas "pull" means that the data consumer invokes the data supplier), the first stage, of reading and parsing the input, is always in pull mode -- I don't know of an XML parser to which you can push lexical XML input.
This means there's a push-pull conflict here. There are two solutions to a push-pull conflict. One is to buffer the data in memory (which is the ByteArrayOutputStream approach mentioned earlier). The other is to use two threads, with one writing to a shared buffer and the other reading from it. This can be achieved using a PipedOutputStream in the writing thread (https://docs.oracle.com/javase/8/docs/api/index.html?java/io/PipedOutputStream.html) and a PipedInputStream in the reading thread.
Caveat: I haven't actually tried this, but I see no reason why it shouldn't work.
Note that the topic of streaming in XSLT 3.0 is fairly complex; you will need to learn about it before you can make much progress here. I would start with Abel Braaksma's talk from XML London 2014: https://xmllondon.com/2014/presentations/braaksma
I'm trying to process the wikipedia dump found here. Specifically with the file - enwiki-latest-pages-articles-multistream.xml.bz2. This is about 46GB uncompressed. I am currently using the STAX parser in Java (xerces) and am able to extract 15K page elements per second. However the bottleneck seems to be the parser and I have toyed around with aalto-xml but it hasn't helped.
Since I'm parsing each page element in the Storm spout, it is a bottleneck. However, I thought I could simply emit the text between ... tags and have several bolts process each of those page elements in parallel. This would reduce the amount of work the Storm spout has to perform. However I am not sure of the specific approach to take here. If I use a parser to extract the content between tags then that would mean it would parse every single element from the beginning of the tag until the end. Is there a way to eliminate this overhead in a standard SAX/STAX parser?
I tried something similar in an attempt to parallelise.
anyway since I use the wikipedia data for many tasks, it was just simpler to generate a one-article-perline-dump which then I can run many experiments from in parallel.
It takes only a few minutes to run, then I have a dump which I can feed to Spark (in your case Storm) very easily.
If you want to use our tools check:
https://github.com/idio/wiki2vec
There is no way to do random access on XML document, but many Java XML parsers have somewhat more efficient skipping of unused content: Aalto and Woodstox for example defer decoding of String values (and construction of String objects) so that if tokens are skipped, no allocations are needed.
One thing to make sure with Stax is to NOT use Event API unless there is specific need to buffer contents -- it does not offer much functionality over basic Streaming API (XMLStreamReader), but does add significant allocation overhead since every XMLEvent is constructed, regardless of whether it is needed. Streaming API, on the other hand, only indicates type of Event/Token, and lets caller decide whether content (attributes, textual value) is needed and can avoid most of object allocations.
We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.
(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
Stax is the right way. I would recommend looking at Woodstox
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.
You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.
Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.
I'm currently trying to read in an XML file, make some minor changes (alter the value of some attributes), and write it back out again.
I have intended to use a StAX parser (javax.xml.stream.XMLStreamReader) to read in each event, see if it was one I wanted to change, and then pass it straight on to the StAX writer (javax.xml.stream.XMLStreamReader) if no changes were required.
Unfortunately, that doesn't look to be so simple - The writer has no way to take an event type and a parser object, only methods like writeAttribute and writeStartElement. Obviously I could write a big switch statement with a case for every possible type of element which can occur in an XML document, and just write it back out again, but it seems like a lot of trouble for something which seems like it should be simple.
Is there something I'm missing that makes it easy to write out a very similar XML document to the one you read in with StAX?
After a bit of mucking around, the answer seems to be to use the Event reader/writer versions rather than the Stream versions.
(i.e. javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter)
See also http://www.devx.com/tips/Tip/37795, which is what finally got me moving.
StAX works pretty well and is very fast. I used it in a project to parse XML files which are up to 20MB. I don't have a thorough analysis, but it was definitely faster than SAX.
As for your question: The difference between streaming and event-handling, AFAIK is control. With the streaming API you can walk through your document step by step and get the contents you want. Whereas the event-based API you can only handle what you are interested in.
I know this is rather old question, but if anyone else is looking for something like this, there is another alternative: Woodstox Stax2 extension API has method:
XMLStreamWriter2.copyEventFromReader(XMLStreamReader2 r, boolean preserveEventData)
which copies the currently pointed-to event from stream reader using stream writer. This is not only simple but very efficient. I have used it for similar modifications with success.
(how to get XMLStreamWriter2 etc? All Woodstox-provided instances implement these extended versions -- plus there are wrappers in case someone wants to use "basic" Stax variants, as well)