I have a program in Java that creates a log file about 1K in size. If I run a test that deletes the old log, and creates a new log, then saves it, repeated a million times, if the size of the file grows over time (up to a few mb's), will I risk damage to my SSD? Is there a size limit for the log file that could avoid this risk, or can anyone help me understand the mechanics of the risk?
In the case of constant same file open/close with gradual file size increase there are 2 protection mechanisms at File System and SSD levels that will prevent early disk failure.
First, on every file delete, File System will initiate Trim (aka Discard aka Logical Erase) command to the SSD. Trim address range will cover entire size of deleting file. Trim greatly helps SSD to reclaim free space for the new data. Using Trim in combination with Writes when accessing the same data range is the best operational mode for SSD in terms of saving its endurance. Just make sure that your OS has Trim operation enabled (usually it is by default). All modern SSDs should support it as well. Important notice, Trim is logical erase, it will not initiate immediate Physical Media erase. Physical Erase will be initiated later indirectly as a part of SSD internal Garbage Collection.
Second, when accessing the same file, most likely File System will issue Writes to the SSD at the same address. Just amount of writes will grow as file size grows. Such pattern is known as Hot Range access. It is nasty pattern for SSD in terms of endurance. SSD has to allocated free resources (physical pages) on every file write but lifetime of the data is very short as the data is deleted almost immediately. Overall, amount of unique data is very low in SSD Physical Media, but amount of allocated and processed resources (physical pages) is huge. Modern SSDs has protection from Hot Range access by using Physical Media Units in round robin manner that evens the wear.
I advise to monitor SSD SMART Health Data (Life-time left parameter), for example by using https://www.smartmontools.org/ or Software provided by SSD Vendor. I will help to see how your access pattern is affecting endurance.
Like with any file, if the disk doesn't contain enough space to write to a file, the OS (or Java) won't allow the file to be written until space is cleared. The only way you can "screw up" a disk in this manner is if you mess around with addresses at the kernel level.
Related
I have just encountered an error in my opensrc library code that allocates a large buffer for making modifications to a large flac file, the error only occurs on an old PC machine with 3Gb of memory using Java 1.8.0_74 25.74-b02 32bit
Originally I used to just allocate a buffer
ByteBuffer audioData = ByteBuffer.allocateDirect((int)(fc.size() - fc.position()));
But for some time I have it as
MappedByteBuffer mappedFile = fc.map(MapMode.READ_WRITE, 0, totalTargetSize);
My (mis)understanding was that mapped buffers use less memory that a direct buffer because the whole mapped buffer doesnt have to be in memory at the same time only the part being used. But this answer says that using mapped byte buffers is a bad idea so Im not qwuite clear how it works
Java Large File Upload throws java.io.IOException: Map failed
The full code can be seen at here
Although a mapped buffer may use less physical memory at any one point in time, it still requires an available (logical) address space equal to the total (logical) size of the buffer. To make things worse, it might (probably) requires that address space to be contiguous. For whatever reason, that old computer appears unable to provide sufficient additional logical address space. Two likely explanations are (1) a limited logical address space + hefty buffer memory requirements, and (2) some internal limitation that the OS is imposing on the amount of memory that can be mapped as a file for I/O.
Regarding the first possibility, consider the fact that in a virtual memory system every process executes in its own logical address space (and so has access to the full 2^32 bytes worth of addressing). So if--at the point in time in which you try to instantiate the MappedByteBuffer--the current size of the JVM process plus the total (logical) size of the MappedByteBuffer is greater than 2^32 bytes (~ 4 gigabytes), then you would run into an OutOfMemoryError (or whatever error/exception that class chooses to throw in its stead, e.g. IOException: Map failed).
Regarding the second possibility, probably the easiest way to evaluate this is to profile your program / the JVM as you attempt to instantiate the MappedByteBuffer. If the JVM process' allocated memory + the required totalTargetSize are well below the 2^32 byte ceiling, but you still get a "map failed" error, then it is likely that some internal OS limit on the size of memory-mapped files is the root cause.
So what does this mean as far as possible solutions go?
Just don't use that old PC. (preferable, but probably not feasible)
Make sure everything else in your JVM has as low a memory footprint as possible for the lifespan of the MappedByteBuffer. (plausible, but maybe irrelevant and definitely impractical)
Break that file up into smaller chunks, then operate on only one chunk at a time. (might depend on the nature of the file)
Use a different / smaller buffer. ...and just put up with the decreased performance. (this is the most realistic solution, even if it's the most frustrating)
Also, what exactly is the totalTargetSize for your problem case?
EDIT:
After doing some digging, it seems clear that the IOException is due to running out of address space in a 32-bit environment. This can happen even when the file itself is under 2^32 bytes either due to the lack of sufficient contiguous address space, or due to other sufficiently large address space requirements in the JVM at the same time combined with the large MappedByteBuffer request (see comments). To be clear, an IOE can still be thrown rather than an OOM even if the original cause is ENOMEM. Moreover, there appear to be issues with older [insert Microsoft OS here] 32-bit environments in particular (example, example).
So it looks like you have three main choices.
Use "the 64-bit JRE or...another operating system" altogether.
Use a smaller buffer of a different type and operate on the file in chunks. (and take the performance hit due to not using a mapped buffer)
Continue to use the MappedFileBuffer for performance reasons, but also operate on the file in smaller chunks in order to work around the address space limitations.
The reason I put using MappedFileBuffer in smaller chunks as third is because of the well-established and unresolved problems in unmapping a MappedFileBuffer (example), which is something you would necessarily have to do in between processing each chunk in order to avoid hitting the 32-bit ceiling due to the combined address space footprint of accumulated mappings. (NOTE: this only applies if it is the 32-bit address space ceiling and not some internal OS restrictions that are the problem... if the latter, then ignore this paragraph) You could attempt this strategy (delete all references then run the GC), but you would essentially be at the mercy of how the GC and your underlying OS interact regarding memory-mapped files. And other potential workarounds that attempt to manipulate the underlying memory-mapped file more-or-less directly (example) are exceedingly dangerous and specifically condemned by Oracle (see last paragraph). Finally, considering that GC behavior is unreliable anyway, and moreover that the official documentation explicitly states that "many of the details of memory-mapped files [are] unspecified", I would not recommend using MappedFileBuffer like this regardless of any workaround you may read about.
So unless you're willing to take the risk, I'd suggest either following Oracle's explicit advice (point 1), or processing the file as a sequence of smaller chunks using a different buffer type (point 2).
When you allocate buffer, you basically get chunk of virtual memory off your operating system (and this virtual memory is finite and upper theoretical is your RAM + whatever swap is configured - whatever else was grabbed first by other programs and OS)
Memory map just adds space occupied on your on disk file to your virtual memory (ok, there is some overhead, but not that much) - so you can get more of it.
Neither of those has to be present in RAM constantly, parts of it could be swapped out to disk at any given time.
I have a Java process and I start it (as suggested here : parameters for FR) with the options :
-XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:StartFlightRecording=duration=2m,filename=myflightrecord.jfr -XX:FlightRecorderOptions=maxsize=100k,maxage=1m
in order to have Flight Recorder information.
I would expect that the maxage=1m would give me only one minute of record, and maxsize=100k the file size wouldnt be larger than 100Kb, but none of them does work as expected.
Another problem that I encounter is that I want the file to be stored every amount of time, lets suppose every one minute. But the file "myflightrecord.jfr" is empty until the duration is reached (2minutes in the example).
Is there any way to make the Flight recorder flush before the end of the duration?
ps: The version of Java I am using is JDK1.8.0_45
This is for JDK 7 and JDK 8 (Hotspot), and JDK 5 and 6 (JRockit).
First, maxsize and maxage only works if you have a disk based recording, since the parameters controls how much data to keep on disk.
If you have an in-memory recording (defaultrecording=true,disk=false), the size of the memory buffers depends on the number of threads that are in use, how much memory each thread is allowed to use, number of global buffers etc.
Flight Recorder was designed for large servers with GB of memory and TB of disk, so I don't think the JVM will be able to respect the number you provided, i.e. a single event could be larger than 100 kb, but typically they are about 50-150 bytes.
Second, the name of the parameters (maxsize and maxage) are misleading. It's not the maximum size/age, but the threshold at which the JVM will remove a log file when they are rotated, which typically happens every 12 MB. To minimize overhead the JVM doesn't stop all the threads immediately when the threshold is met, which means data spills over so in reality it is 12-15 MB. If the system is highly saturated, it could be a lot more, think 30-40 MB.
So setting the maxsize to 100k will not work, you will always get at least 12 Mb,
If you set the maxage to 1 minute you will get data for at least one minute, perhaps more if it can fit in the size of about one log file, 12-15 Mb.
If you have an in-memory recording, the data is copied from the memory buffers to disk when the recording ends. That's why your file is empty. If you want Flight Recorder to write continuously to disk, you should set disk=true.
The maxage and maxsize options only apply when you have a continuous recording (= have not set duration). I also think that they are only guidelines, not exact limits.
If you want to get the data flushed to disk for a continuous recording, you can set disk=true, and if you want to specify where the data should end up, you can set repository=path
(I believe the data will only be flushed to disk when the the in memory buffers are full, I'm not sure if it's when the thread local buffers are full, or when the global buffers are full, see slide 13 in this slidedeck for a picture describing this: http://www.slideshare.net/marcushirt/java-mission-control-java-flight-recorder-deep-dive)
See https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html
XX:FlightRecorderOptions for more info. You can check the threadbuffersize and globalbuffersize as well.
I know the valid combinations of flags have varied a bit, so the documentation might not be entirely up to date.
Kire Haglin can correct me where I've misunderstood things.
I'm writing some code to access an inverted index.
I have two interchangeable class which perform the reads on the index. One reads the index from the disk, buffering part of it. The other load the index completely in memory, as a byte[][] (the index size is around 7Gb) and read from this multidimensional array.
One would expect to have better performances while having the whole data in memory. But my measures state that working with the index on disk it's as fast as having it in memory.
(The time spent to load the index in memory isn't counted in the performances)
Why is this happening? Any ideas?
Further information: I've run the code enabling HPROF. Both working "on disk" or "in memory", the most used code it's NOT the one directly related to the reads. Also, for my (limited) understanding, the gc profiler doesn't show any gc related issue.
UPDATE #1: I've instrumented my code to monitor I/O times. It seems that most of the seeks on memory take 0-2000ns, while most of the seeks on disk take 1000-3000ns. The second metric seems a bit too low for me. Is it due disk caching by Linux? Is there a way to exclude disk caching for benchmarking purposes?
UPDATE #2: I've graphed the response time for every request to the index. The line for the memory and for the disk match almost exactly. I've done some other tests using the O_DIRECT flag to open the file (thanks to JNA!) and in that case the disk version of the code is (obviously) slower than memory. So, I'm concluding that the "problem" was because the aggressive Linux disk caching, which is pretty amazing.
UPDATE #3: http://www.nicecode.eu/java-streams-for-direct-io/
Three possibilities off the top of my head:
The operating system is already keeping all of the index file in memory via its file system cache. (I'd still expect an overhead, mind you.)
The index isn't the bottleneck of the code you're testing.
Your benchmarking methodology isn't quite right. (It can be very hard to do benchmarking well.)
The middle option seems the most likely to me.
No, disk can never be as fast as RAM (RAM is actually in the order of 100,000 times faster for magnetic discs). Most likely the OS is mapping your file in memory for you.
I have a linux box with 32GB of ram and a set of 4 SSD in a raid 0 config that maxes out at about 1GB of throughput (random 4k reads) and I am trying to determine the best way of accessing files on them randomly and conccurently using java. The two main ways I have seen so far are via random access file and mapped direct byte buffers.
Heres where it gets tricky though. I have my own memory cache for objects so any call to the objects stored in a file should go through to disk and not paged memory (I have disabled the swap space on my linux box to prevent this). Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping which is not good because A) I am using all the free memory for the object cache, using mappedbytebuffers instead would incur a massive serialization overhead which is what the object cache is there to prevent.(My program is already CPU limited) B) with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly, this is to prevent data corruption incase of power failure as I am not using ACID transactions.
On the other hand I need massive concurrency, ie. I need to read and write to multiple locations in the same file at the same time (whilst using offset/Range locks to prevent data corruption) I'm not sure how I can do this without mappedbytebuffers, I could always just que the reads/Writes but I'm not sure how this will negatively affect my throughput.
Finally I can not have a situation when I am creating new byte[] objects for reads or writes, this is because I perform almost a 100000 read/write operations per second, allocating and Garbage collecting all those objects would kill my program which is time sensitive and already CPU limited, reusing byte[] objects is fine through.
Please do not suggest any DB software as I have tried most of them and they add to much complexity and cpu overhead.
Anybody had this kind of dilemma?
Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping
No, not if you have enough RAM. The mapping associates pages in memory with pages on disk. Unless the OS decides that it needs to recover RAM, the pages won't be swapped out. And if you are running short of RAM, all that disabling swap does is cause a fatal error rather than a performance degradation.
I am using all the free memory for the object cache
Unless your objects are extremely long-lived, this is a bad idea because the garbage collector will have to do a lot of work when it runs. You'll often find that a smaller cache results in higher overall throughput.
with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly
Actually, it doesn't, unless you've mounted your filesystem with the sync option. And then you still run the risk of data loss from a failed drive (especially in RAID 0).
I'm not sure how I can do this without mappedbytebuffers
A RandomAccessFile will do this. However, you'll be paying for at least a kernel context switch on every write (and if you have the filesystem mounted for synchronous writes, each of those writes will involve a disk round-trip).
I am not using ACID transactions
Then I guess the data isn't really that valuable. So stop worrying about the possibility that someone will trip over a power cord.
Your objections to mapped byte buffers don't hold up. Your mapped files will be distinct from your object cache, and though they take address space they don't consume RAM. You can also sync your mapped byte buffers whenever you want (at the cost of some performance). Moreover, random access files end up using the same apparatus under the covers, so you can't save any performance there.
If mapped bytes buffers aren't getting you the performance you need, you might have to bypass the filesystem and write directly to raw partitions (which is what DBMS' do). To do that, you probably need to write C++ code for your data handling and access it through JNI.
I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.
I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.
I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.
I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.
I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.
I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.
Anybody here with experience achieving sustained write performance with berkeleydb?
Edit 1:
I tried several things already:
Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.
This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...
Edit 2:
After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.
But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.
What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.
Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.
I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.
Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.
BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:
Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.
When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.
This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).
We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).
I need a disk-based key-value store that can sustain high write and read performance for large data sets.
Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.