Editing a large xml file 'on the fly' - java

I've got an xml file stored in a database blob which a user will download via a spring/hibernate web application. After it's retrieved via Hibernate as a byte[] but before it's sent to the output stream I need to edit some parts of the XML (a single node with two child nodes and an attribute).
My concern is if the files are larger (some are 40mb+) then I don't really want to do this by having the whole file in memory, editing it and then passing it to the user via the output stream. Is there a way to edit it 'on the fly' ?
byte[] b = blobRepository.get(blobID).getFile();
// What can I do here?
ServletOutputStream out = response.getOutputStream();
out.write(b);

You can use a SAX stream.
Parse the file using the SAX framework, and as your Handler receives the SAX events, pass the unchanged items back out to a SAX Handler that constructs XML output.
When you get to the "part to be changed", then your intermediary class would read in the unwanted events, and write out the wanted events.
This has the advantage of not holding the entire file in memory as an intermediate representation (say DOM); however, if the transformation is complex, you might have to cache a number of items (sections of the document) in order to have them available for rearranged output. A complex enough transformation (one that could do anything) eventually turns into the overhead of DOM, but if you know you're ignoring a large portion of your document, you can save a lot of memory.

You can try the following:
Enable binary data streaming in Hibernate (set hibernate.jdbc.use_streams_for_binary to true)
Receive xml file as binary stream with ent.getBlob().getBinaryStream()
Process input stream with XSTL processor that supports streaming (e.g. saxon) redirecting output directly to servlet OutputStream: javax.xml.transform.Transformer.transform(SAXSource, new StreamResult(response.getOutputStream()))

Related

Parsing a stream of continous XML documents

I have a socket connection for an external system which accept commands and sends results in XML. Every command and result is a standalone XML document.
Which Java parser (/combination) should i use to:
parse the stream continuously without closing the connection (i know it's stupid, but i tried DOMParser in the past and it throws an exception when an another document root encountered on the stream which is perfectly understandable). I need something like: continously read the stream and when a document is fully received, do it's processing. I don't know how big the document is, so i need to leave to the parser to figure out the end of the document.
deserialize every incoming document into bean instances (similary like XStream does)
serialize command object to the output stream from annotated class instances (similarly like XStream does). I don't want to use two separate libraries for sending and receiving.
Well... XStream.createObjectInputStream seems to be what you need. I'm not sure if the stream provided must enclose all objects into a root node, but anyway you could arrange an inputstreams that add some virtual content to accomodate to XStream needs. I'll expand this answer later...
http://x-stream.github.io/objectstream.html has some samples...
Root node
Indeed the reader needs a root node. So you need an inputstream that reads <object-stream> plus the real byte content, plus a </object-stream> at the end (if you mind about that end). Depending on what you need (inputstream, readers) the implementation can be slighly different but it can be done.
Sample
You can use SequenceInputStream to concatenate virtual content to the original inputstream:
InputStream realOne = ..
// beware of the encoding!
InputStream root = new ByteArrayInputStream("<object-stream>".toBytes("UTF-8"));
InputStream all = new SequenceInputStream(root, realOne);
xstream.createObjectInputStream(withRoot); // voi lá
If you use readers... well. There must be something equivalent :)
Your best bet is probably SAX parser. With it, you can implement ContentHandler document and in there, in endDocument method, do the processing and prepare for the next document. Have a look at this page: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html - for explanation and examples.
I'd say you read one full complete response, then parse it. Then read the other. I see no need to continuously read responses.

Read binary file as byte[] and send from servlet as char[]

I have a servlet which reads BINARY file and sends it to a client.
byte[] binaryData = FileUtils.readFileToByteArray(path);
response.getWriter().print(new String(binaryData));
It works for NON BINARY files. When I have a BINARY file, I get receive file length bigger than origin or received file not the same. How I can read and send binary data?
Thanks.
Not via the Writer. Writers are for text data, not binary data. Your current code is trying to interpret arbitrary binary data as text, using the system default encoding. That's a really bad idea.
You want an output stream - so use response.getOutputStream(), and write the binary data to that:
response.getOutputStream().write(FileUtils.readFileToByteArray(path));
Do not use Writer, it will add encoding of your characters and there will not always be a 1:1 mapping (as you have experienced). Instead use the OutputStream directly.
And avoid reading the full content if you don't need it available at once. Serving many parallel requests will quickly consume memory. FileUtils have methods for this.
FileUtils.copyFile(path, response.getOutputStream());

Java: Question about data representation

I need to parse 70mb data with Java and I've currently a xml document (1-level, no children), where each document has multiple fields.
I was wondering if I should replace it with a simpler text file in which each row is a doc, and the fields are comma separated.
Is this going to significantly improve performances ? And what if the I had, for instance, 4GB data instead ?
thanks
It would probably be more efficient to use the text file than the XML file if you ever get to a point where you can't fit the whole data set into memory at once. at that point, being able to parse the text file line by line would be better than the XML approach (which I believe loads the whole file into memory).
According to a Robin Green XML only parses the whole file at once if you use DOM - SAX parsing streams.
There are other ways to persist data like this:
Database
Can this data be represented in a database? Java has easy support for most database systems, and you just have to install the right libraries to do so.
Java Properties
An alternative is the java properties system. This lets you put all your data on a file and then load them back and java parses the file when loading it.

Processing received data from socket

I am developing a socket application and my application needs to receive xml file over socket. The size of xml files received vary from 1k to 100k. I am now thinking of storing data that I received into a temporary file first, then pass it to the xml parser. I am not sure if it is a proper way to do it.
Another question is if I wanna do as mentioned above, should I pass file object or file path to xml parser?
Thanks in advance,
Regards
Just send it straight to the parser. That's what browsers do. Adding a temp file costs you time and space with no actual benefit.
Do you think it would work to put a BufferedReader around whatever input stream you have? It wouldn't put it into a temporary file, but it would let you hang onto that data. You can set whatever size BufferedReader you need.
Did you write your XML parser? If you didn't, what will it accept as a parameter? If you did write it, are you asking about efficiency. That is to say which object, the path or file, should your parser ask for to be most efficient?
You do not have to store the data from socket to any file. Just read whole the DataInputStream into a byte array and you can then do whatever you need. E.g. if needed create a String with the xml input to feed the parser. (I am assuming tcp sockets).
If there are preceding data you skip them so as to feed the actual xml data to the parser.

Processing XML file with Huge data

I am working on an application which has below requirements -
Download a ZIP file from a server.
Uncompress the ZIP file, get the content (which is in XML format) from this file into a String.
Pass this content into another method for parsing and further processing.
Now, my concerns here is the XML file may be of Huge size say like '100MB', and my JVM has memory of only 512 MB, so how can I get this content into Chunks and pass for Parsing and then insert the data into PL/SQL tables.
Since there can be multiple requests running at the same time and considering 512MB of memory what will be the best possible to process this.
How I can get the data into Chunks and pass it as Stream for XML parsing.
Java's XMLReader is a a SAX2 parser. Where a DOM parser reads the whole of the XML file in and creates a (often large) data structure (usually a tree) to represent its contents, a SAX parser lets you register a handler that will be called when pieces of the XML document are recognized. In that call-back code, you can save only enough data to do what you need -- e.g. you might save all the fields that will end up as a single row in the database, insert that row and then discard the data. With this type of design, your program's memory consumption depends less on the file size than on the complexity and size of a single logical data item (in your case, the data that will become one row in the database).
Even if you did use a DOM-style parser, things might not be quite as bad as you expect. XML is pretty verbose, so (depending on how it's structured and such) a 100 MB file will often represent only 10-20 MB of data, and as little as 5 MB of data wouldn't be particularly rare or unbelievable.
Any SAX parser should work since it won't load the entire XML file into memory like a DOM parser.

Categories