I have two Java programs that are scheduled to run one after the other. When they run in sequence without an overlap, there are no issues. But sometimes one of them gets prolonged a little while more due to the volume it is processing and the second one starts before the first one ends. Now, when this happens at a point when the second one is half way through completion it crashes giving the Max no of files open exception. The first one finishes successfully though. When run separately, with the same volume, there are no issues with either of them. Both processes are completely independent of each other - no common resources, invoked from different scripts and are finally 2 different processes running on the same OS on 2 different JVM's - I use a HP-UNIX system and have tried to trace the open handles using the TUSC utility but there aren't any that could cause such a problem. Xmx is 2Gigs for both and I doubt that would be reached - but is there an explanation to this that I am not seeing? Can the parallel runs be an issue or is it just a coincidence?
The solution to your problem could be either to up your file descriptor limit (see link below) or to make sure that your code is properly closing resources (file input streams, sockets) so you are not leaking open file descriptors. I'll leave it up to you which approach you take.
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02261152/c02261152.pdf
Edit
If your program really is generating that many files at a time, I would certainly look into upping the open file descriptor limit. Then, you might also consider putting in a throttle/regulator into the code. Create X number of files and then back off for a few seconds, let the system reap the resources back and then continue on again. This might be helpful.
Also, this link sounds very similar to your issue:
Apache FOP 1.0 Multithreading - Too many open files err24
In a similar scenario and with resource constraints, we have used the below architecture.
There will be a monitor thread which will be started first which will have a count variable. There will be a configurable limit till which new processes will be created. For every new process, the count variable will be incremented. Once the process is completed, the count variable will be decremented. New processes will be created only if the count is lesser than the configured limit.
Above approach helped us to gain better control and able to scale up wherever it was possible.
Related
I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?
The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.
Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.
Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)
I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.
I am stress-testing a domino java agent which modifies potentially many documents in potentially many databases. I am load-testing the agent with huge databases and my agents are being shut down by the agent manager as they last longer than the specified input in the server document 'Max LotusScript/Java execution time:'.
I am aware that I can write a program document to let the agent run without any timing but don't want to do this since you lose the handle to the agent.
I am aware that I need to program the agent so that I can save the 'task' document (which contains all the instructions for the agent) in an 'unfinished' state so that I can start from where I stopped.
Writing LotusScript agents, there was a possibility of writing cleanup code in the 'Terminate' event of the agent, and I am missing this option for my java agent.
At the moment my best idea is to have a 'timeout' field in my configuration, which would be filled by a value smaller than the server cut-off time. This would imply, however, that I would be asking at very regular intervals the question 'Do I still have time to start the next action?' which I assume is going to kill the performance.
What's your experience with best practise for this case?
Apart from DOTS and a Java Application approach, here are two other alternatives.
Option 1: This is where you want to use a program document and still have some visibility to interact with your agent.
Add checks in your code to check either a file on disk or a document field. If the file is there, or field set then tell your application to start cleaning up.
There would be more overhead on checking a document then checking a file on disk.
Option 2: You can use a java.util.Timer object.
Have this set to execute for ServerMaximumTimeout - X minute/s. In the timer code throw a TimeoutException. Have your main code catch this Exception and do the clean up.
Then in your finally block clean up the timer object if it hasn't died yet.
More details on this in another question.
I am writing code to transfer files to hadoop hdfs parallel. So I have many threads calling filesystem.copyFromLocalFile.
I think the cost of opening a filesystem is not small, so I just have one filesystem opened in my project. So I though there might be a a problem when so many threads calling it at the same time. But so far, it works fine with no problem.
Could anyone please give me some information about this copy method?
Thank you very much& have a great weekend.
I see the following design points to consider:
a) Where will be bottleneck of the process? I think in 2-3 parallel copy operations local disk or 1GB Ethernet will became a bottleneck. You can do it in form of multithreaded application or you can run a few processes. In any case I do not think you need a high level of parallelism.
b) Error handling. Failure of the one thread should not stop the whole process, and, in the same time file should not be lost. What I am usually doing in such cases is to agree that in a worst case file can be copied twice. If it is Ok - system can work in simple "copy then delete" scenario.
c) If you copy from the one of the cluster nodes - HDFS will became unbalanced, since one replica will be stored on the host from where you copy. You will need to do the balance constantly.
Can you tell me what more information you want about copyFromLocalFile()?
I'm not sure but I guess in your case, threads share the same resource among themselves. Since, you have only one instance of FileSystem, each thead will probably share this object in a time sharing basis among themselves.
We've developed a Java standalone program. We've configured in our Linux (RedHat ES 4) cron
schedule to execute this Java standalone every 10 minutes. Each standalone
may sometime take more than 1 hour to complete, or sometime it may complete
even within 5 minutes.
My problem/solution I'm looking for is, the number of Java standalones executing
at any time should not exceed, for example, 5 process. So, for example,
before even a Java standalone/process starts, if there are already 5 processes running,
then this process should not be started; otherwise this would indirectly start
creating OutOfMemoryError problems. How do I control this? I would also like to make this 5 process limit configurable.
Other Information:
I've also configured -Xms and -Xmx heap size settings.
Is there any tool/mechanism by which we can control this?
I also heard about Java Service Wrapper. What is this all about?
You can create 5 empty files (with names "1.lock",...,"5.lock") and make the app to lock one of them to execute (or exit if all files are already locked).
First, I am assuming you are using the words "thread" and "process" interchangably. Two ideas:
Have the cron job be a script that will check the currently running processes and count them. If less than threshold spawn new process, otherwise exit, here threshold can be defined in your script.
Have the main method in your executing java file check some external resource (a file, database table, etc) for a count of running processes, if it is below threshold increment and start process, otherwise exit (this is assuming the simple main method will not be enough to cause your OOME problem). You may also need to use an appropriate locking mechanism on the external resource (though if your job is every 10 minutes, this may be overkill), here you could defin threshold in a .properties, or some other configuration file for your program.
Java Service Wrapper helps you set up a java program as a Windows service or a *nix daemon. It doesn't really deal with the concurrency issue you are looking at--the closest thing is a config setting that disallows concurrent instances if its a Windows service.