Memory mapped files in java

Memory mapped files in java - java

I was reading the book and it has got the below lines:
A MemoryMappedBuffer directly reflects the disk file with which it
is associated. If the file is structurally modified while the mapping
is in effect, strange behavior can result (exact behaviors are, of
course, operating system- and filesystem-dependent). A
MemoryMappedBuffer has a fixed size, but the file it's mapped to is
elastic. Specifically, if a file's size changes while the mapping is
in effect, some or all of the buffer may become inaccessible,
undefined data could be returned, or unchecked exceptions could be
thrown.
So my questions are:
Can't i append text to the files which i have already mapped. If yes then how?
Can somebody please guide me what are the real use cases of memory mapped file and would be great if you can mention what specific problem you have solved by this.
Please bear with me if the questions are pretty naive. Thanks.

Memory mapped files are much faster then regular ByteBuffer version but it will allocate whole memory for example if you map 4MB file operating system will create 4MB file on filesystem that map file to a memory and you can directly write to file just by writing to memory. This is handy when you know exactly how much of data you want to write as if you write less then specified rest of the data array will be filled with zeros. Also Windows will lock the file (can't be deleted until JVM exits), this is not the case on Linux.
Below is the example of appending to a file with memory mapped buffer, for position just put the file size of file that you are writing to:
int BUFFER_SIZE = 4 * 1024 * 1024; // 4MB
String mainPath = "C:\\temp.txt";
SeekableByteChannel dataFileChannel = Files.newByteChannel("C:\\temp.txt", EnumSet.of(StandardOpenOption.WRITE, StandardOpenOption.CREATE, StandardOpenOption.APPEND));
MappedByteBuffer writeBuffer = dataFileChannel.map(FileChannel.MapMode.READ_WRITE, FILE_SIZE, BUFFER_SIZE);
writeBuffer.write(arrayOfBytes);

Related

In Java, can I remove specific bytes from a file?

So far I managed to do something with Byte Stream : read the original file, and write in a new file while omitting the desired bytes (and then finish by deleting/renaming the files so that there's only one left).
I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file. The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.

I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file.
There isn't a SAFE way to do it.
The unsafe way to do it involves (for example) mapping the file using a MappedByteBuffer, and shuffling the bytes around.
But the problem is that if something goes wrong while you are doing this, you are liable to end up with a corrupted file.
Therefore, if the user asks to perform this operation when the device's memory is too full to hold a second copy of the file, the best thing is to tell the user to "delete some files first".
The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.
If you are primarily worried about "memory" on the storage device, see above.
If you are worried about RAM, then #RealSkeptic's observation is correct. You shouldn't need to hold the entire file in RAM at the same time. You can read, modify, write it a buffer at a time.

You can't remove bytes in the middle of the file without placing the rest of the file in memory. But you can replace bytes if it can help you.

How to evaluate size of file in Java before creating it?

In my Java programm I need to create files and write in it something that i can get by Inputstream's read() method. How can I evaluate the size of file before creating it?

Normally, you don't need to know how big the file will be, but if you really do:
The only way you could do that would be to fully read the content from the InputStream into memory first, and then see how much you have.
You have several options for how to read it all into memory, one of which might be to write it to a ByteArrayOutputStream. (And then, of course, write that out to the file when you're ready.)
But again, the great thing about streams is that you don't have to read things all into memory; if you can avoid needing to know the size in advance, that would be best.
Also note that the space the file will occupy on disk won't be exactly the same as the file size; most file systems work in chunks (4k, 8k, 16k, 32k) and so a file that's (say) 12k on a file system using 8k chunks will actually occupy 16k of space.

It's depend of the encoding used, but you can write it in memorystream and get the length.

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?

I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.

You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example

Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory

Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).

Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.

Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j

It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

Reading a file vs loading a file into main memory from disk for processing

how do I load a file into main memory?
I read the files using,
I use
BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
What is the advantage of loading the file directly into memory?
How do we do that in Java?
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory? Should I use them? Which of the two should I use ?
Thanks in advance!!!

BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
Not exactly. It is reading the file in chunks whose size is the default buffer size (8k bytes I think).
The advantage is that you don't need a huge heap to read a huge file. This is a significant issue since the maximum heap size can only be specified at JVM startup (with Hotspot Java).
You also don't consume the system's physical / virtual memory resources to represent the huge heap.
What is the advantage of loading the file directly into memory?
It reduces the number of system calls, and may read the file faster. How much faster depends on a number of factors. And you have the problem of dealing with really large files.
How do we do that in Java?
Find out how large the file is.
Allocate a byte (or character) array big enough.
Use the relevant read(byte[], int, int) or read(char[], int, int) method to read the entire file.
You can also use a memory-mapped file ... but that requires using the Buffer APIs which can be a bit tricky to use.
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory?
No, and no.
Should I use them? Which of the two should I use ?
Do they provide the functionality that you require? Do you need to read / parse text-based data? Do you need to do random access on a binary data?
Under normal circumstances, you should chose your I/O APIs based primarily on the functionality that you require, and secondarily on performance considerations. Using a BufferedInputStream or BufferedReader is usually enough to get acceptable* performance if you intend to parse it as you read it. (But if you actually need to hold the entire file in memory in its original form, then a BufferedXxx wrapper class actually makes reading a bit slower.)
* - Note that acceptable performance is not the same as optimal performance, but your client / project manager probably would not want your to waste time writing code to perform optimally ... if this is not a stated requirement.

If you're reading in the file and then parsing it, walking from beginning to end once to extract your data, then not referencing the file again, a buffered reader is about as "optimal" as you'll get. You can "tune" the performance somewhat by adjusting the buffer size -- a larger buffer will read larger chunks from the file. (Make the buffer a power of 2 -- eg 262144.) Reading in an entire large file (larger than, say, 1mb) will generally cost you performance in paging and heap management.

How to read arbitrary but continuous n lines from a huge file

I would like to read arbitrary number of lines. The files are normal ascii text files for the moment (they may be UTF8/multibyte character files later)
So what I want is for a method to read a file for specific lines only (for example from 101-200) and while doing so it should not block any thing (ie same file can be read by another thread for 201-210 and it should not wait for the first reading operation.
In the case there are no lines to read it should gracefully return what ever it could read. The output of the methods could be a List
The solution I thought up so far was to read the entire file first to find number of lines as well as the byte positions of each new line character. Then use the RandomAccessFile to read bytes and convert them to lines. I have to convert the bytes to Strings (but that can be done after the reading is done). I would avoid the end of file exception for reading beyond file by proper book keeping. The solution is bit inefficient as it does go through the file twice, but the file size can be really big and we want to keep very little in the memory.
If there is a library for such thing that would work, but a simpler native java solution would be great.
As always I appreciate your clarification questions and I will edit this question as it goes.

Why not use Scanner and just loop through hasNextLine() until you get to the count you want, and then grab as many lines as you wish... if it runs out, it'll fail gracefully. That way you're only reading the file once (unless Scanner reads it fully... I've never looked under the hood... but it doesn't sound like you care, so... there you go :)

If you want to minimise memory consumption, I would use a memory mapped file. This uses almost no heap. The amount of the file kept in memory is handled by the OS so you don't need to tune the behaviour yourself.
FileChannel fc = new FileInputStream(fileName).getChannel();
final MappedByteBuffer map = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
If you have a file of 2 GB or more, you need multiple mappings. In the simplest case you can scan the data and remember all the indexes. The indexes them selves could take lots of space so you might only remember every Nth e.g. every tenth.
e.g. a 2 GB file with 40 byte lines could have 50 million lines requiring 400 MB of memory.
Another way around having a large index is to create another memory mapped file.
FileChannel fc = new RandomAccessFile(fileName).getChannel();
final MappedByteBuffer map2 = fc.map(FileChannel.MapMode.READ_WRITE, 0, fc.size()/10);
The problem being, you don't know how big the file needs to be before you start. Fortunately if you make it larger than needed, it doesn't consume memory or disk space, so the simplest thing to do is make it very large and truncate it when you know the size it needs to be.
This could also be use to avoid re-indexing the file each time you load the file (only when it is changed) If the file is only appended to, you could index from the end of the file each time.
Note: Using this approach can use a lot of virtual memory, for a 64-bit JVM this is no problem as your limit is likely to 256 TB. For a 32-bit application, you limits is likely to be 1.5 - 3.5 GB depending on your OS.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.