XMLStreamReader and a real stream - java

Update There is no ready XML parser in Java community which can do NIO and XML parsing. This is the closest I found, and it's incomplete: http://wiki.fasterxml.com/AaltoHome
I have the following code:
InputStream input = ...;
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = xmlInputFactory.createXMLStreamReader(input, "UTF-8");
Question is, why does the method #createXMLStreamReader() expects to have an entire XML document in the input stream? Why is it called a "stream reader", if it can't seem to process a portion of XML data? For example, if I feed:
<root>
<child>
to it, it would tell me I'm missing the closing tags. Even before I begin iterating the stream reader itself. I suspect that I just don't know how to use a XMLStreamReader properly. I should be able to supply it with data by pieces, right? I need it because I'm processing a XML stream coming in from network socket, and don't want to load the whole source text into memory.
Thank you for help,
Yuri.

You can get what you want - a partial parse, but you must not close the stream when you reach the end of the current available data. Keep the stream open, and the parser will simply block when it gets to the end of the stream. When you have more data, then add it to the stream, and the parser will continue.
This arrangement requires two threads - one thread running the parser, and another fetching data. To bridge the two threads, you use a pipe - a PipeInputStream and PipeOutputStream pair that push data from the reader thread into the input stream used by the parser. (The parser is reading data from the PipeInputStream.)

If you absolutely need NIO with content "push", there are developers interested in completing API for Aalto. Parser itself is complete Stax implementation as well as alternative "push input" (feeding input instead of using InputStream). So you might instead want to check out mailing lists if you are interested. Not everyone reads StackOverflow questions. :-)

The stream must contain the content for an entire XML document, just not all in memory at the same time (this is what streams do). You might be able to keep the stream and the reader open to continue feeding in content; however, it would have to be part of a well-formed XML document.
Suggestion: You might want to read a bit more about how sockets and streams work before going much farther.
Hope this helps.

Which Java version are you using? With JDK 1.6.0_19, I get the behaviour you seem to be expecting. Iterating over your example XML fragment gives me three events:
START_ELEMENT (root)
CHARACTERS (whitespace between and )
START_ELEMENT (child)
The fourth invokation of next() throws an XMLStreamException: ParseError at [row,col]:[2,12]
Message: XML document structures must start and end within the same entity.

With the XMLEventReader using stax parser it works for me without any issues.
final XMLEventReader xmlEventReader= XMLInputFactory
.newInstance().createXMLEventReader(new FileInputStream(file));
file is obviously your input.
while(xmlEventReader.hasNext()){
XMLEvent xmlEvent = xmlEventReader.nextEvent();
logger.debug("LOG XML EVENT "+xmlEvent.toString());
if (xmlEvent.isStartElement()){
//continue implementation

Look at this link to understand more about how streaming parsers work and how does it keep you r memory foot print smaller. For incoming XML, you would need to first serialize the incoming XML and create a well formed XML, then giving it to streaming parser.
http://www.devx.com/xml/Article/34037/1954

Related

Run Java XML parser with number of Erlang processes

I have a project in a concurrent and distributed programming course.
In this course we use Erlang.
I need to use some database from an XML file, that already has a parser written in java (this is the link for the XML and the parser: https://dblp.org/faq/1474681.html).
The XML file is 2.5GB, so I understand that the first step is to use a number of processes that I will create in erlang that will parse the XML and each process will parse a chunk of the XML.
The thing is that this is the first time I'm doing something like that (combine erlang and java, and parse a really big XML file), So I'm not sure how to approach this problem - divide the XML to chunks before I start to parse him? Somehow set start and end for each process that parses the XML?
Just to clarify - the course is about erlang and using processes in erlang, so I must use it (because I'm sure that there are java multi-threading solutions).
I will really appreciate any ideas or help!
Thanks!
You can do it in Erlang without using Java. You do not need to read file completely before processing. You should use an XML parser which supports XML streaming API. I recommend to use fast_xml which is too fast (it uses C functions to parse XML).
After initializing stream parser state, in a loop (recursive function) you should read file chunk by chunk (for example 1024 byte each chunk) and give each chunk to parser. If parser finds new XML elements, it will send them to your callback process in form of erlang messages. In your callback process you can spawn more processes to work on each XML element.

Is it possible to create an XSL Transformer Output Stream?

Is it possible to create a "TransformerOutputStream", which extends the standard java.io.OutputStream, wraps a provided output stream and applies an XSL transformation? I can't find any combination of APIs which allows me to do this.
The key point is that, once created, the TransformerOutputStream may be passed to other APIs which accept a standard java.io.OutputStream.
Minimal usage would be something like:
java.io.InputStream in = getXmlInput();
java.io.OutputStream out = getTargetOutput();
javax.xml.transform.Templates templates = createReusableTemplates(); // could also use S9API
TransformerOutputStream tos = new TransformerOutputStream(out, templates); // extends OutputStream
com.google.common.io.ByteStreams.copy(in, tos);
// possibly flush/close tos if required by implementation
That's a JAXP example, but as I'm currently using Saxon an S9API solution would be fine too.
The main avenue I've persued is along the lines of:
a class which extends java.io.OutputStream and implements org.xml.sax.ContentHandler
an XSL transformer based on an org.xml.sax.ContentHandler
But I can't find implementations of either of these, which seems to suggest that either no one else has ever tried to do this, there is some problem which makes it impractical, or my search skills just are not that good.
I can understand that with some templates an XML transformer may require access to the entire document and so a SAX content handler may provide no advantage, but there must also be simple transformations which could be applied to the stream as it passed through? This kind of interface would leave that decision up to the transformer implementation.
I have a written and am currently using a class which provides this interface, but it just collects the output data in an internal buffer then uses a standard JAXP StreamSource to read that on flush or close, so ends up buffering the entire document.
You could make your TransformerOutputStream extend ByteArrayOutputStream, and its close() method could take the underlying byte[] array, wrap it in a ByteArrayInputStream, and invoke a transformation with the input taken from this InputStream.
But it seems you also want to avoid putting the entire contents of the stream in memory. So let's assume that the transformation you want to apply is an XSLT 3.0 streamable transformation. Unfortunately, although Saxon as a streaming XSLT transformer operates largely in push mode (by "push" I mean that the data supplier invokes the data consumer, whereas "pull" means that the data consumer invokes the data supplier), the first stage, of reading and parsing the input, is always in pull mode -- I don't know of an XML parser to which you can push lexical XML input.
This means there's a push-pull conflict here. There are two solutions to a push-pull conflict. One is to buffer the data in memory (which is the ByteArrayOutputStream approach mentioned earlier). The other is to use two threads, with one writing to a shared buffer and the other reading from it. This can be achieved using a PipedOutputStream in the writing thread (https://docs.oracle.com/javase/8/docs/api/index.html?java/io/PipedOutputStream.html) and a PipedInputStream in the reading thread.
Caveat: I haven't actually tried this, but I see no reason why it shouldn't work.
Note that the topic of streaming in XSLT 3.0 is fairly complex; you will need to learn about it before you can make much progress here. I would start with Abel Braaksma's talk from XML London 2014: https://xmllondon.com/2014/presentations/braaksma

Editing a large xml file 'on the fly'

I've got an xml file stored in a database blob which a user will download via a spring/hibernate web application. After it's retrieved via Hibernate as a byte[] but before it's sent to the output stream I need to edit some parts of the XML (a single node with two child nodes and an attribute).
My concern is if the files are larger (some are 40mb+) then I don't really want to do this by having the whole file in memory, editing it and then passing it to the user via the output stream. Is there a way to edit it 'on the fly' ?
byte[] b = blobRepository.get(blobID).getFile();
// What can I do here?
ServletOutputStream out = response.getOutputStream();
out.write(b);
You can use a SAX stream.
Parse the file using the SAX framework, and as your Handler receives the SAX events, pass the unchanged items back out to a SAX Handler that constructs XML output.
When you get to the "part to be changed", then your intermediary class would read in the unwanted events, and write out the wanted events.
This has the advantage of not holding the entire file in memory as an intermediate representation (say DOM); however, if the transformation is complex, you might have to cache a number of items (sections of the document) in order to have them available for rearranged output. A complex enough transformation (one that could do anything) eventually turns into the overhead of DOM, but if you know you're ignoring a large portion of your document, you can save a lot of memory.
You can try the following:
Enable binary data streaming in Hibernate (set hibernate.jdbc.use_streams_for_binary to true)
Receive xml file as binary stream with ent.getBlob().getBinaryStream()
Process input stream with XSTL processor that supports streaming (e.g. saxon) redirecting output directly to servlet OutputStream: javax.xml.transform.Transformer.transform(SAXSource, new StreamResult(response.getOutputStream()))

Parsing a stream of continous XML documents

I have a socket connection for an external system which accept commands and sends results in XML. Every command and result is a standalone XML document.
Which Java parser (/combination) should i use to:
parse the stream continuously without closing the connection (i know it's stupid, but i tried DOMParser in the past and it throws an exception when an another document root encountered on the stream which is perfectly understandable). I need something like: continously read the stream and when a document is fully received, do it's processing. I don't know how big the document is, so i need to leave to the parser to figure out the end of the document.
deserialize every incoming document into bean instances (similary like XStream does)
serialize command object to the output stream from annotated class instances (similarly like XStream does). I don't want to use two separate libraries for sending and receiving.
Well... XStream.createObjectInputStream seems to be what you need. I'm not sure if the stream provided must enclose all objects into a root node, but anyway you could arrange an inputstreams that add some virtual content to accomodate to XStream needs. I'll expand this answer later...
http://x-stream.github.io/objectstream.html has some samples...
Root node
Indeed the reader needs a root node. So you need an inputstream that reads <object-stream> plus the real byte content, plus a </object-stream> at the end (if you mind about that end). Depending on what you need (inputstream, readers) the implementation can be slighly different but it can be done.
Sample
You can use SequenceInputStream to concatenate virtual content to the original inputstream:
InputStream realOne = ..
// beware of the encoding!
InputStream root = new ByteArrayInputStream("<object-stream>".toBytes("UTF-8"));
InputStream all = new SequenceInputStream(root, realOne);
xstream.createObjectInputStream(withRoot); // voi lá
If you use readers... well. There must be something equivalent :)
Your best bet is probably SAX parser. With it, you can implement ContentHandler document and in there, in endDocument method, do the processing and prepare for the next document. Have a look at this page: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html - for explanation and examples.
I'd say you read one full complete response, then parse it. Then read the other. I see no need to continuously read responses.

Processing received data from socket

I am developing a socket application and my application needs to receive xml file over socket. The size of xml files received vary from 1k to 100k. I am now thinking of storing data that I received into a temporary file first, then pass it to the xml parser. I am not sure if it is a proper way to do it.
Another question is if I wanna do as mentioned above, should I pass file object or file path to xml parser?
Thanks in advance,
Regards
Just send it straight to the parser. That's what browsers do. Adding a temp file costs you time and space with no actual benefit.
Do you think it would work to put a BufferedReader around whatever input stream you have? It wouldn't put it into a temporary file, but it would let you hang onto that data. You can set whatever size BufferedReader you need.
Did you write your XML parser? If you didn't, what will it accept as a parameter? If you did write it, are you asking about efficiency. That is to say which object, the path or file, should your parser ask for to be most efficient?
You do not have to store the data from socket to any file. Just read whole the DataInputStream into a byte array and you can then do whatever you need. E.g. if needed create a String with the xml input to feed the parser. (I am assuming tcp sockets).
If there are preceding data you skip them so as to feed the actual xml data to the parser.

Categories