Keeping memory mapped files from growing too large

Keeping memory mapped files from growing too large - java

I'm wanting to use memory-mapped IO to establish communications between two applications of mine (primarily to avoid the problem of sockets tending to leak to other computers on the network). However, one issue I am concerned about is storage space: as I continue writing commands to the file, that file is only going to get larger. Granted, most of the commands are short and it would take a few days of constant runtime for it to become a problem, but I would like to avoid it all the same. Is there a good way for me to periodically clear the file of "old" messages that my recipient application has already read, thus reclaiming disc storage space?

Related

What is the right way to create/write a large file in java that are generated by a user?

I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?

The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.

Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.

Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)

Storing 1 MB byte array as session attribute

I am running a Java web app.
A user uploads a file (max 1 MB) and I would like to store that file until the user completes an entire process (which consists of multiple requests).
Is it ok to store the file as a byte array in the session until the user completes the entire process? Or is this expensive in terms of resources used?
The reason I am doing this is because I ultimately store the file on an external server (eg aws s3) but I only want to send it to that server if the whole process is completed.
Another option would be to just write the file to a temporary file on my server. However, this means I would need to remove the file in case the user exits the website. But it seems excessive for me to add code to the SessionDestroyed method in my SessionListener which removes the file if it’s just for this one particular case (ie: sessions are created throughout my entire application where I don’t need to check for temp files).
Thanks.

Maybe Yes, maybe No
Certainly it is reasonable to store such data in memory in a session if that fits your deployment constraints.
Remember that each user has their own session. So if all of your users have such a file in their session, then you must multiply to calculate the approximate impact on memory usage.
If you exceed the amount of memory available at runtime, there will be consequences. Your Servlet container may serialize less-used sessions to storage, which is a problem if you’ve not programmed all of your objects to support serialization. The JVM and OS may use a swap file to move contents out of real memory as part of the virtual memory system. That swapping may impact or even cripple performance.
You must consider your runtime deployment constraints, which you did not disclose. Are you running on a Raspberry Pi or inexpensive little cloud server with little memory available? Or will you run on an enterprise-class server with half a terabyte of RAM? Do you have 3 users, 300, or 30,000? You need to crunch the numbers and determine your needs, and maybe do some runtime profiling to see actual usage.
For example… I write web apps using the Vaadin Framework, a sophisticated package for creating desktop-style apps within a web browser. Being Servlet-based, Vaadin maintains a complete representation of each user’s entire work data on the server-side in the Servlet session. Multiplied by the number of users, and depending on the complexity of the app, this may require much memory. So I need to account for this and run my server on sufficient hardware with 64-bit Java tuned to run with a large amount of memory. Or take other approaches such load-balancing across multiple servers with sticky sessions.
Fortunately, RAM is quite cheap nowadays. And 64-bit hardware with large physical support for RAM modules, 64-bit operating systems, and 64-bit JVM implementations ( Azul, others ) are all readily available.

out-of-memory error -- why not paging?

Out-of-memory error occurs frequently in the java programs. My question is simple: when exceeding the memory limitation, why java directly kill the program rather than swap it out to the disk? I think memory paging/swapping strategy is frequently used in the modern operating system and programming languages like c++ definitely supports swapping. Thanks.

#Pimgd is sorta on track: but #Kayaman is right. Java doesn't handle memory besides requesting it from the system. C++ doesn't support swapping, it requests memory from the OS and the OS will do the swapping. If you request enough memory for your application with -Xmx, it might start swapping because the OS thinks it can.

Because Java is cross-platform. There might not be a disk.
Other reasons could be that such a thing would affect performance and the developers didn't want that to happen (because Java already carries a performance overhead?).

A few words about paging. Virtual memory using paging - storing 4K (or similar) chunks of any program that runs on a system - is something an operating system can or cannot do. The promise of an address space only limited by the capacity of a machine word used to store an address sounds great, but there's a severe downside, which is called thrashing. This happens when the number of page (re)loads exceeds a certain frequency, which in turn is due of too many processes requesting too much memory in combination with non-locality of memory accesses of those processes. (A process has a good locality if it can execute long stretches of code while accessing only a small percentage of its pages.)
Paging also requires (fast) secondary storage.
The ability to limit your program's memory resources (as in Java) is not only a burden; it must also be seen as a blessing when some overall plan for resource usage needs to be devised for a, say, server system.

top reasons why an app server crashes

What are the most likely causes for application server failure?
For example: "out of disk space" is more likely than "2 of the drives in a RAID 4 setup die simultaneously".
My particular environment is Java, so Java-specific answers are welcome, but not required.
EDIT just to clarify, i'm looking for downtime-related crashes (out of memory is a good example) not just one-time issues (like a temporary network glitch).

If you are trying to keep an application server up, start monitoring it. Nagios, Big Sister, and other Network Monitoring tools can be very useful.
Watch memory availability / usage, disk availability / usage, cpu availability / usage, etc.
The most common reason why a server goes down is rarely the same reason twice. Someone "fixes" the last-most-common-reason, and a new-most-common-reason is born.

Edwin is right - you need monitoring to understand what the problem is. Or better - understand what the problem is AND prevent it from causing downtime.
You should not only track resource consumption but also demand. The difference between the two shows you if you have sized your server correctly.
There are a ton of open source tools like nagios, CollectD, etc. that can give you server specific data - that's only monitoring though, not prevention. Librato Silverline (disclosure: I work there) allows you to monitor individual processes and then throttle the resources they use by placing them in application containers for which you define resource polices.
If your server is 8 cores or less you can use it for free.

"Out of Memory" exception due to memory leaks.

All sorts of things can cause a server to crash, ranging from busted hardware (e.g. disk failures) to faulty code (memory leak resulting in an out of memory exception, network failure that got rethrown as a runtime exception and was never caught, in servers that aren't Java servers a SEGFAULT, etc.)

At first, it is usually because of memory leaks, disk space problems, endless loops causing cpu to eat up.
Once you monitor those issues and set up correct logging and warning mechanisms, they turn meta on you... and exploding error handling becomes a possible reason for a full lockup: an error (or more likely: two in an unhappy combination) occurs but when the handler is trying to write to the logfiles or send a warning (by mail or something) it gets another error which it is trying to handle by writing to the logfile or sending a warning or... and this continues until one of the resources gives out: it may lead to skyrocketing server load, memory problems, filling disk space, locking up network traffic which means it won't be accessible for a remote user to correct the problem, etc.

Solaris: virtual slices/disks for use with ZFS

This is a little related to my previous question Solaris: Mounting a file system on an application's handlers except this question is for a different purpose and is simpler as there is no open/close/lock it is just a fixed length block of bytes with read/write operations.
Is there anyway I can create a virtual slice, kinda like a RAM disk or a SVM slice.. but I want the reads and writes to go through my app.
I am planning to use ZFS to take multiple of these virtual slices/disks and make them into one larger one for distributed backup storage with snapshots. I really like the compression and stacking that ZFS offers. If necessary I can guarantee that there is only one instance of ZFS accessing these virtual disks at a time (to prevent cache conflicts and such). If the one instance goes down, we can make sure it won't start back up and then we can start another instance of that ZFS.
I am planning to have those disks in chunks of about 4GB or so,, then I can move around each chunk and decide where to store them (multiple times mirrored of course) and then have ZFS access the chunks and put them together in to larger chunks for actual use. Also ZFS would permit adding of these small chunks if necessary to increase the size of the larger chunk.
I am aware there would be extra latency / network traffic if we used my own app in Java, but this is just for backup storage. The production storage is entirely different configuration that does not relate.
Edit: We have a system that uses all the space available and basically when there is not enough space it will remove old snapshots and increase the gaps between old snapshots. The purpose of my proposal is to allow the unused space from production equipment to be put to use at no extra cost. At different times different units of our production equipment will have free space. Also the system I am describing should eliminate any single point of failure when attempting to access data. I am hoping to not have to buy two large units and keep them synchronized. I would prefer just to have two access points and then we can mix large/small units in any way we want and move data around seamlessly.
This is a cross post because this is more software related than sysadmin related The original question is here: https://serverfault.com/questions/212072. it may be a good idea for the original to be closed

One way would be to write a Solaris device driver, precisely a block device one emulating a real disk but that will communicate back to your application instead.
Start with reading the Device Driver Tutorial, then have a look at OpenSolaris source code for real drivers code.
Alternatively, you might investigate modifying Solaris iSCSI target to be the interface with your application. Again, looking at OpenSolaris COMSTAR will be a good start.

It seems that any fixed length file on any file system will do for a block device for use with ZFS. Not sure how reboots work, but I am sure we can get write some boot up commands to work that out.
Edit: The fixed length file would be on a network file system such as NFS.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.