Parsing a stream of continous XML documents

Parsing a stream of continous XML documents - java

I have a socket connection for an external system which accept commands and sends results in XML. Every command and result is a standalone XML document.
Which Java parser (/combination) should i use to:
parse the stream continuously without closing the connection (i know it's stupid, but i tried DOMParser in the past and it throws an exception when an another document root encountered on the stream which is perfectly understandable). I need something like: continously read the stream and when a document is fully received, do it's processing. I don't know how big the document is, so i need to leave to the parser to figure out the end of the document.
deserialize every incoming document into bean instances (similary like XStream does)
serialize command object to the output stream from annotated class instances (similarly like XStream does). I don't want to use two separate libraries for sending and receiving.

Well... XStream.createObjectInputStream seems to be what you need. I'm not sure if the stream provided must enclose all objects into a root node, but anyway you could arrange an inputstreams that add some virtual content to accomodate to XStream needs. I'll expand this answer later...
http://x-stream.github.io/objectstream.html has some samples...
Root node
Indeed the reader needs a root node. So you need an inputstream that reads <object-stream> plus the real byte content, plus a </object-stream> at the end (if you mind about that end). Depending on what you need (inputstream, readers) the implementation can be slighly different but it can be done.
Sample
You can use SequenceInputStream to concatenate virtual content to the original inputstream:
InputStream realOne = ..
// beware of the encoding!
InputStream root = new ByteArrayInputStream("<object-stream>".toBytes("UTF-8"));
InputStream all = new SequenceInputStream(root, realOne);
xstream.createObjectInputStream(withRoot); // voi lá
If you use readers... well. There must be something equivalent :)

Your best bet is probably SAX parser. With it, you can implement ContentHandler document and in there, in endDocument method, do the processing and prepare for the next document. Have a look at this page: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html - for explanation and examples.

I'd say you read one full complete response, then parse it. Then read the other. I see no need to continuously read responses.

Related

Is it possible to create an XSL Transformer Output Stream?

Is it possible to create a "TransformerOutputStream", which extends the standard java.io.OutputStream, wraps a provided output stream and applies an XSL transformation? I can't find any combination of APIs which allows me to do this.
The key point is that, once created, the TransformerOutputStream may be passed to other APIs which accept a standard java.io.OutputStream.
Minimal usage would be something like:
java.io.InputStream in = getXmlInput();
java.io.OutputStream out = getTargetOutput();
javax.xml.transform.Templates templates = createReusableTemplates(); // could also use S9API
TransformerOutputStream tos = new TransformerOutputStream(out, templates); // extends OutputStream
com.google.common.io.ByteStreams.copy(in, tos);
// possibly flush/close tos if required by implementation
That's a JAXP example, but as I'm currently using Saxon an S9API solution would be fine too.
The main avenue I've persued is along the lines of:
a class which extends java.io.OutputStream and implements org.xml.sax.ContentHandler
an XSL transformer based on an org.xml.sax.ContentHandler
But I can't find implementations of either of these, which seems to suggest that either no one else has ever tried to do this, there is some problem which makes it impractical, or my search skills just are not that good.
I can understand that with some templates an XML transformer may require access to the entire document and so a SAX content handler may provide no advantage, but there must also be simple transformations which could be applied to the stream as it passed through? This kind of interface would leave that decision up to the transformer implementation.
I have a written and am currently using a class which provides this interface, but it just collects the output data in an internal buffer then uses a standard JAXP StreamSource to read that on flush or close, so ends up buffering the entire document.

You could make your TransformerOutputStream extend ByteArrayOutputStream, and its close() method could take the underlying byte[] array, wrap it in a ByteArrayInputStream, and invoke a transformation with the input taken from this InputStream.
But it seems you also want to avoid putting the entire contents of the stream in memory. So let's assume that the transformation you want to apply is an XSLT 3.0 streamable transformation. Unfortunately, although Saxon as a streaming XSLT transformer operates largely in push mode (by "push" I mean that the data supplier invokes the data consumer, whereas "pull" means that the data consumer invokes the data supplier), the first stage, of reading and parsing the input, is always in pull mode -- I don't know of an XML parser to which you can push lexical XML input.
This means there's a push-pull conflict here. There are two solutions to a push-pull conflict. One is to buffer the data in memory (which is the ByteArrayOutputStream approach mentioned earlier). The other is to use two threads, with one writing to a shared buffer and the other reading from it. This can be achieved using a PipedOutputStream in the writing thread (https://docs.oracle.com/javase/8/docs/api/index.html?java/io/PipedOutputStream.html) and a PipedInputStream in the reading thread.
Caveat: I haven't actually tried this, but I see no reason why it shouldn't work.
Note that the topic of streaming in XSLT 3.0 is fairly complex; you will need to learn about it before you can make much progress here. I would start with Abel Braaksma's talk from XML London 2014: https://xmllondon.com/2014/presentations/braaksma

How to limit scope for XPath

I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.

I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.

Processing received data from socket

I am developing a socket application and my application needs to receive xml file over socket. The size of xml files received vary from 1k to 100k. I am now thinking of storing data that I received into a temporary file first, then pass it to the xml parser. I am not sure if it is a proper way to do it.
Another question is if I wanna do as mentioned above, should I pass file object or file path to xml parser?
Thanks in advance,
Regards

Just send it straight to the parser. That's what browsers do. Adding a temp file costs you time and space with no actual benefit.

Do you think it would work to put a BufferedReader around whatever input stream you have? It wouldn't put it into a temporary file, but it would let you hang onto that data. You can set whatever size BufferedReader you need.
Did you write your XML parser? If you didn't, what will it accept as a parameter? If you did write it, are you asking about efficiency. That is to say which object, the path or file, should your parser ask for to be most efficient?

You do not have to store the data from socket to any file. Just read whole the DataInputStream into a byte array and you can then do whatever you need. E.g. if needed create a String with the xml input to feed the parser. (I am assuming tcp sockets).
If there are preceding data you skip them so as to feed the actual xml data to the parser.

XMLStreamReader and a real stream

Update There is no ready XML parser in Java community which can do NIO and XML parsing. This is the closest I found, and it's incomplete: http://wiki.fasterxml.com/AaltoHome
I have the following code:
InputStream input = ...;
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = xmlInputFactory.createXMLStreamReader(input, "UTF-8");
Question is, why does the method #createXMLStreamReader() expects to have an entire XML document in the input stream? Why is it called a "stream reader", if it can't seem to process a portion of XML data? For example, if I feed:
<root>
<child>
to it, it would tell me I'm missing the closing tags. Even before I begin iterating the stream reader itself. I suspect that I just don't know how to use a XMLStreamReader properly. I should be able to supply it with data by pieces, right? I need it because I'm processing a XML stream coming in from network socket, and don't want to load the whole source text into memory.
Thank you for help,
Yuri.

You can get what you want - a partial parse, but you must not close the stream when you reach the end of the current available data. Keep the stream open, and the parser will simply block when it gets to the end of the stream. When you have more data, then add it to the stream, and the parser will continue.
This arrangement requires two threads - one thread running the parser, and another fetching data. To bridge the two threads, you use a pipe - a PipeInputStream and PipeOutputStream pair that push data from the reader thread into the input stream used by the parser. (The parser is reading data from the PipeInputStream.)

If you absolutely need NIO with content "push", there are developers interested in completing API for Aalto. Parser itself is complete Stax implementation as well as alternative "push input" (feeding input instead of using InputStream). So you might instead want to check out mailing lists if you are interested. Not everyone reads StackOverflow questions. :-)

The stream must contain the content for an entire XML document, just not all in memory at the same time (this is what streams do). You might be able to keep the stream and the reader open to continue feeding in content; however, it would have to be part of a well-formed XML document.
Suggestion: You might want to read a bit more about how sockets and streams work before going much farther.
Hope this helps.

Which Java version are you using? With JDK 1.6.0_19, I get the behaviour you seem to be expecting. Iterating over your example XML fragment gives me three events:
START_ELEMENT (root)
CHARACTERS (whitespace between and )
START_ELEMENT (child)
The fourth invokation of next() throws an XMLStreamException: ParseError at [row,col]:[2,12]
Message: XML document structures must start and end within the same entity.

With the XMLEventReader using stax parser it works for me without any issues.
final XMLEventReader xmlEventReader= XMLInputFactory
.newInstance().createXMLEventReader(new FileInputStream(file));
file is obviously your input.
while(xmlEventReader.hasNext()){
XMLEvent xmlEvent = xmlEventReader.nextEvent();
logger.debug("LOG XML EVENT "+xmlEvent.toString());
if (xmlEvent.isStartElement()){
//continue implementation

Look at this link to understand more about how streaming parsers work and how does it keep you r memory foot print smaller. For incoming XML, you would need to first serialize the incoming XML and create a well formed XML, then giving it to streaming parser.
http://www.devx.com/xml/Article/34037/1954

Castor and sockets

I'm new to Castor and data binding in general. I'm working on an application that, in part, needs to take data off of a socket and unmarshall the data to make POJOs. Now, I've got the socket stuff down, and I've even generated and compiled java files thanks to Ant and Castor.
Here's the problem: the data stream that I'll receive could be one of about 9 different objects. That is, I receive a stream of text (XML) that represents an object with stuff that I'll operate on; again, depending on the object type. If it were just one object, it'd be easy: call the unmarshall commands on it and go on my merry way. But, since it could be one of many kinds of objects, who do I know what to unmarshall? I read up on mapping, but either I didn't get it, or it seems like a static mapping, not a dynamic mapping.
Any help out there?

You are right, Castor expects a static mapping. But you can work with that. You can write some code that will modify the incoming xml so that, on your side, Castor can use one schema, and on your clients' side they don't have to change their schemas.
Change the schema that Castor expects to get to something with a common root-element, with under that your nine different alternatives for your different objects (I think you can restrict it so the schema will allow only one of the nine, if that doesn't work out you could just make all the sub-elements optional).
Then you can write code that modifies the incoming xml to wrap your incoming xml with that common root-element, then feeds the wrapped xml into a stream that gets read by the Castor unmarshaller.
There are at least 3 different ways to implement the xml-wrapping part: SAX, XSLT, and XML libraries (like JDOM, DOM4J, and XOM--I prefer XOM but any of them will work).
The SAX way is probably best if you're already familiar with SAX or if one of the other ways has worked but come up short on performance. If I had to implement that then I would create an XMLFilter that takes in xml and writes xml out, stacking that on top of another piece that writes xml to an OutputStream, and writing a wrapper method around the unmarshalling stuff to feed the incoming stream to the xmlreader, copy the OutputStream to another InputStream (an easy way is to use commons-io), and feed the new InputStream to the Castor unmarshaller.
With XSLT there is no fooling with SAX, although XSLT has a reputation for pain sometimes, it seems to me like this might be a relatively straightforward transformation, but I haven't taken a stab at it either. It is a long time since I used XSLT for anything. I am not sure about performance either, though I wouldn't write it off out of hand.
Using XOM or JDOM or DOM4J to wrap the XML is also possible, and the learning curve is a lot lower than for SAX or XSLT. The downside is the whole XML document tends to get sucked into memory at once so if you deal with big enough documents you could run out of memory.

I have a similar thing in Jibx where all of the incoming message objects implement a base interface which has a field denoting the message type.
The text/xml is serialized into the base interface and I then used the command pattern to call the respective business logic depending upon the message type which is defined in the base interface.
Not sure if this is possible using castor but take a look at Jibx as the performance is fantastic.
http://jibx.sourceforge.net/

I appreciate your insights. You both have given me some good information to go on and new knowledge that I didn't have. In the end, I got the process to work via a hack. I grab the text stream, parse out the root tag of the message, and then switch on it to determine the right object to create. I'm unmarshalling all of my objects independently and everyone is happy on our end.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.