Java MergeSort Binary files

Java MergeSort Binary files - java

I have several sorted binary files which store information in some variable length format (meaning one of the segments contains the length of the variable length segment).
I need to merge them into one sorted file. I can do so with BufferedInputStream successfully. Nevertheless, it takes very long time on a mechanical disk. On a machine with SSD its much faster, as expected.
What bothers me is the fact that even on SSD, the CPU utilization is very low, and makes me suspect there's a way to improve the speed. I assume this happens because most of the time the CPU waits on the disk. I tried to increase the buffers to hundreds of MBs to no avail.
I have tried to use MemoryMapped buffer and file channel but it didn't improve the runtime.
Any ideas?
Edit: Using MemoryMappedByteBuffer failed because the merged file size is over 2 GB, which is the size limitation of MemoryMappedByteBuffer. But even before having merged the smaller files into GB files, I didn't notice an improvement in speed or CPU utilization.
Thanks

Perhaps you can compress the files better or is that not an option? If the bottleneck is I/O then reducing the amount is a good attack angle.
http://www.oracle.com/technetwork/articles/java/compress-1565076.html

Related

Risk of repeated write/delete to SSD?

I have a program in Java that creates a log file about 1K in size. If I run a test that deletes the old log, and creates a new log, then saves it, repeated a million times, if the size of the file grows over time (up to a few mb's), will I risk damage to my SSD? Is there a size limit for the log file that could avoid this risk, or can anyone help me understand the mechanics of the risk?

In the case of constant same file open/close with gradual file size increase there are 2 protection mechanisms at File System and SSD levels that will prevent early disk failure.
First, on every file delete, File System will initiate Trim (aka Discard aka Logical Erase) command to the SSD. Trim address range will cover entire size of deleting file. Trim greatly helps SSD to reclaim free space for the new data. Using Trim in combination with Writes when accessing the same data range is the best operational mode for SSD in terms of saving its endurance. Just make sure that your OS has Trim operation enabled (usually it is by default). All modern SSDs should support it as well. Important notice, Trim is logical erase, it will not initiate immediate Physical Media erase. Physical Erase will be initiated later indirectly as a part of SSD internal Garbage Collection.
Second, when accessing the same file, most likely File System will issue Writes to the SSD at the same address. Just amount of writes will grow as file size grows. Such pattern is known as Hot Range access. It is nasty pattern for SSD in terms of endurance. SSD has to allocated free resources (physical pages) on every file write but lifetime of the data is very short as the data is deleted almost immediately. Overall, amount of unique data is very low in SSD Physical Media, but amount of allocated and processed resources (physical pages) is huge. Modern SSDs has protection from Hot Range access by using Physical Media Units in round robin manner that evens the wear.
I advise to monitor SSD SMART Health Data (Life-time left parameter), for example by using https://www.smartmontools.org/ or Software provided by SSD Vendor. I will help to see how your access pattern is affecting endurance.

Like with any file, if the disk doesn't contain enough space to write to a file, the OS (or Java) won't allow the file to be written until space is cleared. The only way you can "screw up" a disk in this manner is if you mess around with addresses at the kernel level.

Why WalkingFileTree is faster the second time?

I am using the function Files.walkfiletree() from java.NIO, I am looking over a really big tree so when I run the app the first time (with first time I mean each time I turn on my computer) the app takes some time but the second time is really fast.
why? is some cache working? Can I use this in some permanent way?

When you read data from the filesystem, that information is caching making accessing it again much faster. In some cases 100x faster or more. It caches the data in memory because it is faster.
The simplest solution is to access/load this directory structure before you need it and you will get cached performance. e.g. you can do this on start up.
Another solution is to get a faster SSD. Accessing file structures performs a lot of disk operations to get all the pieces of information. A HDD can do up to 120 IOPS, a cheap SSD can do 40,000 IOPS and a fast SSD can do 250,000 IOPS. This can dramatically reduce the time to load this information.
However, since you cannot control what is in memory, except by accessing it repeatedly, it may be pushed out of the disk cache later.

Java: are there situations where disk is as fast as memory?

I'm writing some code to access an inverted index.
I have two interchangeable class which perform the reads on the index. One reads the index from the disk, buffering part of it. The other load the index completely in memory, as a byte[][] (the index size is around 7Gb) and read from this multidimensional array.
One would expect to have better performances while having the whole data in memory. But my measures state that working with the index on disk it's as fast as having it in memory.
(The time spent to load the index in memory isn't counted in the performances)
Why is this happening? Any ideas?
Further information: I've run the code enabling HPROF. Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads. Also, for my (limited) understanding, the gc profiler doesn't show any gc related issue.
UPDATE #1: I've instrumented my code to monitor I/O times. It seems that most of the seeks on memory take 0-2000ns, while most of the seeks on disk take 1000-3000ns. The second metric seems a bit too low for me. Is it due disk caching by Linux? Is there a way to exclude disk caching for benchmarking purposes?
UPDATE #2: I've graphed the response time for every request to the index. The line for the memory and for the disk match almost exactly. I've done some other tests using the O_DIRECT flag to open the file (thanks to JNA!) and in that case the disk version of the code is (obviously) slower than memory. So, I'm concluding that the "problem" was because the aggressive Linux disk caching, which is pretty amazing.
UPDATE #3: http://www.nicecode.eu/java-streams-for-direct-io/

Three possibilities off the top of my head:
The operating system is already keeping all of the index file in memory via its file system cache. (I'd still expect an overhead, mind you.)
The index isn't the bottleneck of the code you're testing.
Your benchmarking methodology isn't quite right. (It can be very hard to do benchmarking well.)
The middle option seems the most likely to me.

No, disk can never be as fast as RAM (RAM is actually in the order of 100,000 times faster for magnetic discs). Most likely the OS is mapping your file in memory for you.

Performance characteristics of memory mapped file

Background:
I have a Java application which does intensive IO on quite large
memory mapped files (> 500 MB). The program reads data, writes data,
and sometimes does both.
All read/write functions have similar computation complexity.
I benchmarked the IO layer of the program and noticed strange
performance characteristics of memory mapped files:
It performs 90k reads per second (read 1KB every iteration at random position)
It performs 38k writes per second (write 1KB every iteration sequentially)
It performs 43k writes per second (write 4 bytes every iteration at random position)
It performs only 9k read/write combined operation per second (read 12 bytes then write 1KB every iteration, at random position)
The programs on 64-bit JDK 1.7, Linux 3.4.
The machine is an ordinary Intel PC with 8 threads CPU and 4GB physical memory. Only 1 GB was assigned to JVM heap when conducting the benchmark.
If more details are needed, here is the benchmark code: https://github.com/HouzuoGuo/Aurinko2/blob/master/src/test/scala/storage/Benchmark.scala
And here is the implementation of the above read, write, read/write functions: https://github.com/HouzuoGuo/Aurinko2/blob/master/src/main/scala/aurinko2/storage/Collection.scala
So my questions are:
Given fixed file size and memory size, what factors affect memory mapped file random read performance?
Given fixed file size and memory size, what factors affect memory mapped file random write performance?
How do I explain the benchmark result of read/write combined operation? (I was expecting it to perform over 20K iterations per second).
Thank you.

The memory mapped file performance depends on disk performance, file system type, free memory available for file system cache and read/write block size. The page size on the linux is 4K. So you should expect most performance with 4k read/writes. An access at random position causes page fault if page is not mapped and will pull a new page read. Usually, you want memory mapped file if you want to see the files as a one memory array ( or ByteBuffer in Java ).

BerkeleyDB write performance problems

I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.
I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.
I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.
I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.
I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.
I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.
Anybody here with experience achieving sustained write performance with berkeleydb?
Edit 1:
I tried several things already:
Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.
This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...
Edit 2:
After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.
But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.

What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.
Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.
I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.
Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.

BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:
Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.
When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.

This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).

We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).

I need a disk-based key-value store that can sustain high write and read performance for large data sets.
Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.