Optimising Java's NIO for small files

Optimising Java's NIO for small files - java

We have a file I/O bottleneck. We have a directory which contains lots of JPEG files, and we want to read them in in real time as a movie. Obviously this is not an ideal format, but this is a prototype object tracking system and there is no possibility to change the format as they are used elsewhere in the code.
From each file we build a frame object which basically means having a buffered image and an explicit bytebuffer containing all of the information from the image.
What is the best strategy for this? The data is on a SSD which in theory has read/write rates around 400Mb/s, but in practice is reading no more than 20 files per second (3-4Mb/s) using the naive implementation:
bufferedImg = ImageIO.read(imageFile);[1]
byte[] data = ((DataBufferByte)bufferedImg.getRaster().getDataBuffer()).getData();[2]
imgBuf = ByteBuffer.wrap(data);
However, Java produces lots of possibilities for improving this.
(1) CHannels. Esp File Channels
(2) Gathering/Scattering.
(3) Direct Buffering
(4) Memory Mapped Buffers
(5) MultiThreading - use a bunch of callables to access many files simultaneously.
(6) Wrapping the files in a single large file.
(7) Other things I haven't thought of yet.
I would just like to know if anyone has extensively tested the different options, and knows what is optimal? I assume that (3) is a must, but I would still like to optimise the reading of a single file as far as possible, and am unsure of the best strategy.
Bonus Question: In the code snipped above, when does the JVM actually 'hit the disk' and read in the contents of the file, is it [1] or is that just a file handler which `points' to the object? It kind of makes sense to lazily evaluate but I don't know how the implementation of the ImageIO class works.

ImageIO.read(imageFile)
As it returns BufferedImage, I assume it will hit disk and just not file handler.

Related

In Java, can I remove specific bytes from a file?

So far I managed to do something with Byte Stream : read the original file, and write in a new file while omitting the desired bytes (and then finish by deleting/renaming the files so that there's only one left).
I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file. The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.

I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file.
There isn't a SAFE way to do it.
The unsafe way to do it involves (for example) mapping the file using a MappedByteBuffer, and shuffling the bytes around.
But the problem is that if something goes wrong while you are doing this, you are liable to end up with a corrupted file.
Therefore, if the user asks to perform this operation when the device's memory is too full to hold a second copy of the file, the best thing is to tell the user to "delete some files first".
The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.
If you are primarily worried about "memory" on the storage device, see above.
If you are worried about RAM, then #RealSkeptic's observation is correct. You shouldn't need to hold the entire file in RAM at the same time. You can read, modify, write it a buffer at a time.

You can't remove bytes in the middle of the file without placing the rest of the file in memory. But you can replace bytes if it can help you.

Best way to merge binary files in Java

I'm developing a basic download manager that can download a file over http using multiple connections. At the end of the download, I have several temp file containing each a part of the file.
I now want to merge them into a single file.
It's not hard to do so, simply create an output stream and input streams and pipe the inputs into the output in the good order.
But I was wondering: is there a way to do it more efficiently? I mean, from my understanding what will happen here is that the JVM will read byte per byte the inputs, and write byte per byte the output.
So basically I have :
- read byte from disk
- store byte in memory
- some CPU instructions will probably run and the byte will probably be copied into the CPU's cache
- write byte to the disk
I was wondering if there was a way to keep the process on the disk? I don't know if I'm understandable, but basically to tell the disk "hey disk, take these files of yours and make one with them"
In a short sentence, I want to reduce the CPU & memory usage to the lowest possible.

In theory it may be possible to do this operation on a file system level: you could append the block list from one inode to another without moving the data. This is not very practical though, most likely you would have to bypass your operating system and access the disk directly.
The next best thing may be using FileChannel.transferTo or transferFrom methods:
This method is potentially much more efficient than a simple loop that reads from this channel and writes to the target channel. Many operating systems can transfer bytes directly from the filesystem cache to the target channel without actually copying them.
You should also test reading and writing large blocks of bytes using streams or RandomAccessFile - it may still be faster than using channels. Here's a good article about testing sequential IO performance in Java.

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?

I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.

You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example

Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory

Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).

Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.

Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j

It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

When am I doubling my memory usage?

I have a servlet that users post a XML file to.
I read that file using:
String xml = request.getParameter("...");
Now say that xml document is 10KB, since I created the variable xml I am now using 10KB of memory for that variable correct?
Now I need to parse that xml (using xerces), and I converted it to a input stream when passing to it to the saxparsers parse method (http://docs.oracle.com/javase/1.5.0/docs/api/javax/xml/parsers/SAXParser.html).
So if I convert a string to a stream, is that doubling my memory usage?
Need some clarifications on this.
If I connect my process with visualvm or jconsole, while stepping through the code, can I see if I am using additional memory as I step through the code in my debugger?
I want to make sure I am not doing this inefficienctly as this endpoint will be hit hard.

A 10,000 bytes of text generally turns into 20 KB.
When you process text you generally need 2-10x more memory as you will be doing something with that information, such as creating a data structure.
This means you could need 200 KB. However given that in a PC this represents 1 cents worth, I wouldn't worry about it normally. If you have a severely resource limited device, I would consider moving the processing to another device like a server.

I think you might be optimizing your code before actually seeing it running. The JVM is very good and fast to recover unused memory.
But answering your question String xml = request.getParameter("..."); doesn't double the memory, it just allocates an extra 4 or 8 bytes (depending if the JVM is using compressed pointers) for the pointer.
Parsing the xml is different the SAX parser is very memory efficient, so it won't use too much memory, I think around 20 bytes per Handler plus any instance variables that you have... and obviously any extra objects that you might generate in the handler.
So the code you have looks like it's as memory efficient as it can get (depending of what you have in your handlers, of course).
Unless you're working on embedding that code in a device or running it 100k times a second, I would suggest you not to optimize anything unless you're sure you need to optimize it. The JVM has some crazy advanced logic to optimize code and the garbage collector is very fast to recover short lived objects.

If users can post massive files back to your servlet, then it is best not to use the getParameter() methods and handle the stream directly - Apache File Upload Library.
That way you can use the SAX Parser on the InputStream (and the whole text does not need to be loaded into memory before processing) - as you would have to do with the String based solution.
This approach scales well and requires only a tiny amount of memory per request compared to the String xml = getParameter(...) solution.

You will code like this:
saxParser.parse(new InputSource(new StringReader(xml));
You first need to create StringReader around xml. This won't double your memory usage, StringReader class merely wraps xml variable and return it character by character when requested.
InputSource is even thinner - it simply wraps provided Reader or InputStream. So in short: no, your String won't be copied, your implementation is pretty good.

No, you won't get 2 copies of the string, doubling your memory. Other things might double that memory, but the string itself won't be duplicated.
Yes, you should connect visualVm and jconsole to see what happens to memory and thread processing.

Java: Where can I find advanced file manipulation source/libraries?

I'm writing arbitrary byte arrays (mock virus signatures of 32 bytes) into arbitrary files, and I need code to overwrite a specific file given an offset into the file. My specific question is: is there source code/libraries that I can use to perform this particular task?
I've had this problem with Python file manipulation as well. I'm looking for a set of functions that can kill a line, cut/copy/paste, etc. My assumptions are that these are extremely common tasks, and I couldn't find it in the Java API nor my google searches.
Sorry for not RTFM well; I haven't come across any information, and I've been looking for a while now.

Maybe you are looking for something like the RandomAccessFile class in the standard Java JDK. It supports reads and writes at some offset, as well as byte arrays.

Java's RandomAccessFile is exactly what you want.
It includes methods like seek(long) that allow you to move wherever you need in the file. It also allows for reading and writing at the same time.

As far as I know, Java has primarily lower level functions for manipulating files directly. Here is the best I've come up with
The actions you describe are standard in the Swing world, and for text comes down to manipulating a Document object. These act on data in memory. The class java.nio.channels.FileChannel has similar methods that act directly on a file. Neither fine the end of lines automatically, but other classes in java.io and java.nio do.
Apache Commons has a sandbox library called Flatfile which looks like it does what you want. The problem is that no code has been released yet. You may, however, want to talk to people working on it to get some more ideas. I didn't do a general check on libraries.

Have you looked into File/FileReader/FileWriter/BufferedReader? You can get the contents of the files and manipulate it as you like, you can search the data in the files, you can overwrite files, create new, append to an existing....
I am not sure this is exactly what you are asking for but I use these APIs all the time for logging, RTF editors, text file creation for email, and many other things.
As far as cut/copy/past goes, I have not come across the ability to do that directly, however, you can output the contents of the file and "copy" what part of it you want and "paste" it into a new file, or append it to an existing.

While writing a byte array to a file is a common task, writing to a give file 32-bytes byte array just once is just not something you are going to find in java.io :)
To get started, would the below method and comments look reasonable to you? I bet someone here, maybe even myself, could whip it out quick like.
public static void writeFauxVirusSignature(File file, byte[] bytes, long offset) {
//open file
//move to offset
//write bytes
//close file
}
Questions:
How big could the potential target files be?
Do you need performance?
I ask because clean, easy to read code would use Apache Commons lib's, but large file writes in a performance sensitive environment will necessitate using java.nio libraries

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimising Java's NIO for small files - java

ImageIO.read(imageFile) As it returns BufferedImage, I assume it will hit disk and just not file handler.

Related

In Java, can I remove specific bytes from a file?

Best way to merge binary files in Java

reading files from memory instead of disk

When am I doubling my memory usage?

Java: Where can I find advanced file manipulation source/libraries?

Categories

Resources