When am I doubling my memory usage? - java

I have a servlet that users post a XML file to.
I read that file using:
String xml = request.getParameter("...");
Now say that xml document is 10KB, since I created the variable xml I am now using 10KB of memory for that variable correct?
Now I need to parse that xml (using xerces), and I converted it to a input stream when passing to it to the saxparsers parse method (http://docs.oracle.com/javase/1.5.0/docs/api/javax/xml/parsers/SAXParser.html).
So if I convert a string to a stream, is that doubling my memory usage?
Need some clarifications on this.
If I connect my process with visualvm or jconsole, while stepping through the code, can I see if I am using additional memory as I step through the code in my debugger?
I want to make sure I am not doing this inefficienctly as this endpoint will be hit hard.

A 10,000 bytes of text generally turns into 20 KB.
When you process text you generally need 2-10x more memory as you will be doing something with that information, such as creating a data structure.
This means you could need 200 KB. However given that in a PC this represents 1 cents worth, I wouldn't worry about it normally. If you have a severely resource limited device, I would consider moving the processing to another device like a server.

I think you might be optimizing your code before actually seeing it running. The JVM is very good and fast to recover unused memory.
But answering your question String xml = request.getParameter("..."); doesn't double the memory, it just allocates an extra 4 or 8 bytes (depending if the JVM is using compressed pointers) for the pointer.
Parsing the xml is different the SAX parser is very memory efficient, so it won't use too much memory, I think around 20 bytes per Handler plus any instance variables that you have... and obviously any extra objects that you might generate in the handler.
So the code you have looks like it's as memory efficient as it can get (depending of what you have in your handlers, of course).
Unless you're working on embedding that code in a device or running it 100k times a second, I would suggest you not to optimize anything unless you're sure you need to optimize it. The JVM has some crazy advanced logic to optimize code and the garbage collector is very fast to recover short lived objects.

If users can post massive files back to your servlet, then it is best not to use the getParameter() methods and handle the stream directly - Apache File Upload Library.
That way you can use the SAX Parser on the InputStream (and the whole text does not need to be loaded into memory before processing) - as you would have to do with the String based solution.
This approach scales well and requires only a tiny amount of memory per request compared to the String xml = getParameter(...) solution.

You will code like this:
saxParser.parse(new InputSource(new StringReader(xml));
You first need to create StringReader around xml. This won't double your memory usage, StringReader class merely wraps xml variable and return it character by character when requested.
InputSource is even thinner - it simply wraps provided Reader or InputStream. So in short: no, your String won't be copied, your implementation is pretty good.

No, you won't get 2 copies of the string, doubling your memory. Other things might double that memory, but the string itself won't be duplicated.
Yes, you should connect visualVm and jconsole to see what happens to memory and thread processing.

Related

Extract part of XML file [duplicate]

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?
I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.
You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example
Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory
Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).
Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.
Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j
It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

Optimising Java's NIO for small files

We have a file I/O bottleneck. We have a directory which contains lots of JPEG files, and we want to read them in in real time as a movie. Obviously this is not an ideal format, but this is a prototype object tracking system and there is no possibility to change the format as they are used elsewhere in the code.
From each file we build a frame object which basically means having a buffered image and an explicit bytebuffer containing all of the information from the image.
What is the best strategy for this? The data is on a SSD which in theory has read/write rates around 400Mb/s, but in practice is reading no more than 20 files per second (3-4Mb/s) using the naive implementation:
bufferedImg = ImageIO.read(imageFile);[1]
byte[] data = ((DataBufferByte)bufferedImg.getRaster().getDataBuffer()).getData();[2]
imgBuf = ByteBuffer.wrap(data);
However, Java produces lots of possibilities for improving this.
(1) CHannels. Esp File Channels
(2) Gathering/Scattering.
(3) Direct Buffering
(4) Memory Mapped Buffers
(5) MultiThreading - use a bunch of callables to access many files simultaneously.
(6) Wrapping the files in a single large file.
(7) Other things I haven't thought of yet.
I would just like to know if anyone has extensively tested the different options, and knows what is optimal? I assume that (3) is a must, but I would still like to optimise the reading of a single file as far as possible, and am unsure of the best strategy.
Bonus Question: In the code snipped above, when does the JVM actually 'hit the disk' and read in the contents of the file, is it [1] or is that just a file handler which `points' to the object? It kind of makes sense to lazily evaluate but I don't know how the implementation of the ImageIO class works.
ImageIO.read(imageFile)
As it returns BufferedImage, I assume it will hit disk and just not file handler.

My JSON files are too big to fit into memory, what can I do?

In my program, I am reading a series of text files from the disk. With each text file, I process out some data and store the results as JSON on the disk. In this design, each file has its own JSON file. In addition to this, I also store some of the data in a separate JSON file, which stores relevant data from multiple files. My problem is that the shared JSON grows larger and larger with every file parsed, and eventually uses too much memory. I am on a 32-bit machine and have 4 GB of RAM, and cannot increase the memory size of the Java VM anymore.
Another constraint to consider is that I often refer back to the old JSON. For instance, say I pull out ObjX from FileY. In pseudo code, the following happens (using Jackson for JSON serialization/deserialization):
// In the main method.
FileYJSON = parse(FileY);
ObjX = FileYJSON.get(some_key);
sharedJSON.add(ObjX);
// In sharedJSON object
List objList;
function add(obj)
if (!objList.contains(obj))
objList.add(obj);
The only thing I can think to do is use streaming JSON, but the problem is that I frequently need to access the JSON that came before, so I don't know that stream will work. Also my data types on not only strings, which prevents me from using Jackson's streaming capabilities (I believes). Does anyone know of a good solution?
If you're getting to the point where your data structures are so large that you're running out of memory, you'll have to start using something else. I would recommend that you use a database, which will significantly speed up data retrieval and storage. It will also make the limit of your data structure the size of your hard drive, instead of the size of your RAM.
Try this page for an introduction to Java and Databases.
I can't believe that you really need nearly 4GB RAM only for text files and JSON.
I see three possible solutions.
Switch to plain text if it's possible. That is not that memory hungry.
Just open and close the files as you need them. You can order the files to a specific naming convention, like the first two/three/... digits of their hashes, and open them as you need them.
If you have so many data, you could maybe switch to a database. That would save a lot of resources.
I would prefer option 3 if it's possible for you.
you can make api and get responce.body from it

What is the fastest way to output a large amount of data?

I have an JAX-RS web service that calls a db2 z/os database and returns about 240mb of data in a resultset. I am then creating an OutputStream to send this data to the client by looping through the resultset and adding a few XML tags for my output.
I am confused about what to use PrintWriter, BufferedWriter or OutputStreamWriter. I am looking for the fastest way to deliver the data. I also don't want the JVM to hold onto this data any longer than it needs to, so I don't use up it's memory.
Any help is appreciated.
You should use
BufferedWriter
Call .flush() frequently
Enable gzip for best compression
Start thinking about a different way of doing this. Can your data be paginated? Do you need all the data in one request.
If you are sending a large binary data, you probably don't want to use xml. When xml is used, binary data is usually represented using base64 which becomes larger than the original binary and uses quite a lot of CPU for the conversion into base64.
If I were you, I'd send the binary separate from the xml. If you are using WebService, MTOM attachment could help. Otherwise you could send the reference to the binary data in the xml, and let the app. download the binary data separately.
As for the fastest way to send binary, if you are using weblogic, just writing on the response's outputstram would be ok. That output stream is most probably buffered and whatever you do probably won't change the performance anyways.
Turning on gzip could also help depending on what you are sending (e.g. if you are sending jpeg (stuff that is already compressed) or something, it won't help a lot but if you are sending raw text then it can help a lot, etc.).
One solution (which might not work for you) is to spawn a job / thread that creates a file and then notifies the user when the file is ready to download, in this way you're not tied to the bandwidth of the client connection (and you can even compress the file properly, before the client downloads it)
Some Business Intelligence and data crunching applications do this, specially if the process takes some time to generate the data.
The output max speed will me limited by network bandwith and i am shure any Java OutputStream will be much more faster than you will notice the difference.
The choice depends on the data to send: is that text (lines) PrintWriter is easy, is that a byte array take OutputStream.
To hold not too much data in the buffers you should call flush() any x kb maybe.
You should never use PrintWriter to output data over a network. First of all, it creates platform-dependent line breaks. Second, it silently catches all I/O exceptions, which makes it hard for you to deal with those exceptions.
And if you're sending 240 MB as XML, then you're definitely doing something wrong. Before you start worrying about which stream class to use, try to reduce the amount of data.
EDIT:
The advice about PrintWriter (and PrintStream) came from a book by Elliotte Rusty Harold. I can't remember which one, but it was a few years ago. I think that ServletResponse.getWriter() was added to the API after that book was written - so it looks like Sun didn't follow Rusty's advice. I still think it was good advice - for the reasons stated above, and because it can tempt implementation authors to violate the API contract
in order to get predictable behavior.

Categories