Using a concurrent hashmap to reduce memory usage with threadpool? - java

I'm working with a program that runs lengthy SQL queries and stores the processed results in a HashMap. Currently, to get around the slow execution time of each of the 20-200 queries, I am using a fixed thread pool and a custom callable to do the searching. As a result, each callable is creating a local copy of the data which it then returns to the main program to be included in the report.
I've noticed that 100 query reports, which used to run without issue, now cause me to run out of memory. My speculation is that because these callables are creating their own copy of the data, I'm doubling memory usage when I join them into another large HashMap. I realize I could try to coax the garbage collector to run by attempting to reduce the scope of the callable's table, but that level of restructuring is not really what I want to do if it's possible to avoid.
Could I improve memory usage by replacing the callables with runnables that instead of storing the data, write it to a concurrent HashMap? Or does it sound like I have some other problem here?

Don't create copy of data, just pass references around, ensuring thread safety if needed. If without data copying you still have OOM, consider increasing max available heap for application.
Drawback of above approach not using copy of data is that thread safety is harder to achieve, though.

Do you really need all 100-200 reports at the same time?
May be it's worth to limit the 1st level of caching by just 50 reports and introduce a 2nd level based on WeakHashMap?
When 1st level exceeds its size LRU will be pushed to the 2nd level which will depend on the amount of available memory (with use of WeakHashMap).
Then to search for reports you will first need to query 1st level, if value is not there query 2nd level and if value is not there then report was reclaimed by GC when there was not enough memory and you have to query DB again for this report.

Do the results of the queries depend on other query results? If not, whenever you discover the results in another thread, just use a ConcurrentHashMap like you are implying. Do you really need to ask if creating several unnecessary copies of data is causing your program to run out of memory? This should almost be obvious.

Related

Java How to make sure that a shared data write affects multiple readers

I am trying to code a processor intensive task, so I would like to use multithreading and share the calculation between the available processor cores.
Let's say I have thousands of iterations and all iterations have two phases:
Some working threads that scans through hundreds of thousands of options
while they have to read data from a shared array (or some other data structure), while there is no modification of the data.
One thread that collects the results from all the working threads (while
they are waiting) and makes modifications on the shared array
The phases are in sequence, so that there is no overlap (no concurrent writing and reading of the data). My problem is: How would I be sure that the data (cache) is updated for the working threads before the next phase, Phase 1, starts.
I am assuming that when people speak about cache or caching in this context, they mean the processor cache (fix me if I'm wrong).
As I understood, volatile can be used for nonreference types only, while there is no point to use synchronized, because the working threads will block each other at reading (there can be thousands of reads when processing an option).
What else can I use in this case?
Right now I have a few ideas, but I have no idea how costly they are (most probably they are):
create new working threads for all iterations
in a synchronized block make a copy of the array (can be up to 195kB in size) for each threads before a new iteration begins
I red about ReentrantReadWriteLock, but I can't understand how is it related to caching. Can a read lock acquire force the reader's cache to update?
The thing I was searching for was mentioned in the "Java Tutorial on Concurrence" I just had to look deeper. In this case it was the AtomicIntegerArray class. Unfortunately it is not efficient enough for my needs. I run some tests, maybe it worth to share.
I approximated the cost of different memory access methods, by running them many times and averaged the elapsed times, broke everything down to one average read or write.
I used a size of 50000 integer array, and repeated every test methods 100 times, then averaged the results. The read tests are performing 50000 random(ish) reads. The results shows the approximated time of one read/write access. Still, this can't be stated as exact measurement, but I believe it gives a good sense of the time costs of the different access methods. However on different processors or with different numbers these results may be completely different regarding to the different cache sizes, and clock speeds.
So the results are:
Fill time with set is: 15.922673ns
Fill time with lazySet is: 4.5303152ns
Atomic read time is: 9.146553ns
Synchronized read time is: 57.858261399999996ns
Single threaded fill time is: 0.2879112ns
Single threaded read time is: 0.3152002ns
Immutable copy time is: 0.2920892ns
Immutable read time is: 0.650578ns
Points 1 and 2 shows the write result on an AtomicIntegerArray, with sequential writes. In some article I red about the good efficiency of the lazySet() mehtod so I wanted to test it. It is usually over perform the set() method by about 4 times, however different array sizes show different results.
Points 3 and 4 shows the difference between the "atomic" access and synchronized access (a synchronized getter) to one item of the array via random(ish) reads by four different threads simultaneously. This clearly indicates the benefits of the "atomic" access.
Since the first four value looked shockingly high, I really wanted to measure the access times without multithreading, so I got the reslults of points 5 and 6. I tried to copy and modify methods from the previous tests, to make the code as close as it is possible. Of course there can be optimizations I can't affect.
Then just out of curiosity I come up with points 7. and 8. which imitates the immutable access. Here one thread creates the array (by sequential writes) and passes it's reference to an another thread which does the random(ish) read accesses on it.
The results are heavily vary, if the parameters are changed, like the size of the array or the count of the methods running.
The conclusion:
If an algorithm is extremely memory intensive (lots of reads from the same small array, interrupted by short calculations - which is my case), multithreading can slow down the calculation instead of speeding it up. But if it has many many reads, compared to the size of the array, it may be helpful to use an immutable copy of the array, and use multiple threads.

Available memory based thread scheduler

I am looking to make one of my applications more efficient from resource utilization perspective and would like to get your inputs to help me with the same.
This application connects to a number of databases. On every db, it runs a query, brings a bunch of records to memory and performs some operations.
I am using Executors.newFixedThreadPool(n) for spawning multiple threads, each to handle task corresponding a db. However, depending on the number of records fetched for the dbs being processed at a given point of time, the memory footprint fluctuates.
In an ideal scenario, I would have wanted to reduce my thread pool size (not supported in the current setup) in case the available memory gets lower than a threshold. The scheduler could essentially defer picking up the next task until we have sufficient memory available.
My question is whether such intelligent scheduling logic is already available somewhere that I can use or need to build it from scratch?
Thanks.

Using MapDB efficiently (confused about commits)

I'm using MapDB in a project that deals with billions of Objects that need to be mapped/queued. I don't need any kind of persistence after the program finishes (the MapDB databases are all temporary). I want the program to run as fast as possible, but I'm confused about MapDB's commit() function (which I assume is relevant to performance), even after reading the docs. My questions:
What exactly does commit do? My working understanding is that it serializes Objects from the heap to disk, thus freeing heap space. Is this accurate?
What happens to the references to Objects that were just committed? Do they get cleaned up by GC, or do they somehow 'reference' an Object on disk (with MapDB making this transparent?)
Ultimately I want to know how to use MapDB as efficiently as I can, but I can't do that without knowing what commit() is for. I'd appreciate any other advice that you might have for using MapDB efficiently.
The commit operation is an operation on transactions, just as you would find in a database system. MapDB implements transactions, so commit is effectively 'make the changes I've made to this DB permanent and visible to other users of it'. The complimentary operation is rollback, which discards all of the changes you've made within the current transaction. Commit doesn't (directly) affect what is in memory and what is not. You might want to look at compact() instead, if you're trying to reclaim heap space.
For your second question, if you're holding a strong reference to an object then you continue holding that strong reference. MapDB isn't going to delete it for you. You should think of MapDB as a normal Java Map, most of the time. When you call get, MapDB hides whether it's in memory or on disk from you and just returns you a usable reference to the retrieved object. That retrieved object will hang around in memory until it becomes garbage, just like anything else.
It is a good idea to try to commit not after every single change to a map you make, but instead do it on some sort of schedule.
like
Every N changes
Every M seconds
After some sort of logical checkpoints in your code.
Doing too many commits will make your application very slow.

Is there any way to free memory during large data proccessing?

I have a database where I store invoices. I have to make complex operations for any given month with a series of algorithms using the information from all of the invoices. Retrieving and processing the necessary data for these operations takes a large amount of memory since there might be lots of invoices. The problem gets increasingly worse when the interval requested by the user for these calculations goes up to several years. The result is I'm getting a PermGen exception because it seems that the garbage collector is not doing its job between each month calculation.
I've always taken using System.GC to hint the GC to do its job is not a good practice. So my question is, are there any other ways to free memory aside from that? Can you force the JVM to use HD swapping in order to store partial calculations temporarily?
Also, I've tried to use System.gc at the end of each month calculation and the result was a high CPU usage (due to garbage collector being called) and a reasonably lower memory use. This could do the job but I don't think this would be a proper solution.
Don't ever use System.gc(). It always takes a long time to run and often doesn't do anything.
The best thing to do is rewrite your code to minimize memory usage as much as possible. You haven't explained exactly how your code works, but here are two ideas:
Try to reuse the data structures you generate yourself for each month. So, say you have a list of invoices, reuse that list for the next month.
If you need all of it, consider writing the processed files to temporary files as you do the processing, then reloading them when you're ready.
We should remember System.gc() does not really run the Garbage Collector. It simply asks to do the same. The JVM may or may not run the Garbage Collector. All we can do is to make unnecessary data structures available for garbage collection. You can do the same by:
Assigning Null as the value of any data structure after it has been used. Hence no active threads will be able to access it( in short enabling it for gc).
reusing the same structures instead of creating new ones.

What does "costly" mean in terms of software operations?

What is meant by Operation is costly or the resource is costly in-terms of Software. When i come across with some documents they mentioned something like Opening a file every-time is a Costly Operation. I can have more examples like this (Database connection is a costly operation, Thread pool is a cheaper one, etc..). At what basis it decided whether the task or operation is costly or cheaper? When we calculating this what the constraints to consider? Is based on the Time also?
Note : I already checked in the net with this but i didn't get any good explanation. If you found kindly share with me and i can close this..
Expensive or Costly operations are those which cause a lot of resources to be used, such as the CPU, Disk Drive(s) or Memory
For example, creating an integer variable in code is not a costly or expensive operation
By contrast, creating a connection to a remote server that hosts a relational database, querying several tables and returning a large results set before iterating over it while remaining connected to the data source would be (relatively) expensive or costly, as opposed to my first example with the Integer.
In order to build scalable, fast applications you would generally want to minimize the frequency of performing these costly/expensive actions, applying techniques of optimisation, caching, parallelism (etc) where they are essential to the operation of the software.
To get a degree of accuracy and some actual numbers on what is 'expensive' and what is 'cheap' in your application, you would employ some sort of profiling or analysis tool. For JavaScript, there is ySlow - for .NET applications, dotTrace - I'd be certain that whatever the platform, a similar solution exists. It's then down to someone to comprehend the output, which is probably the most important part!
Running time, memory use or bandwidth consumption are the most typical interpretations of "cost". Also consider that it may apply to cost in development time.
I'll try explain through some examples:
If you need to edit two field in each row of a Database, if you do it one field at a time that's gonna be close to twice the time as if it was properly done both at same time.
This extra time was not only your waste of time, but also a connection opened longer then needed, memory occupied longer then needed and at the end of the day, your eficience goes down the drain.
When you start scalling, very small amount of time wasted grows into a very big waste of Company resources.
It is almost certainly talking about a time penalty to perform that kind of input / output. Lots of memory shuffling (copying of objects created from classes with lots of members) is another time waster (pass by reference helps eliminate a lot of this).
Usually costly means, in a very simplified way, that it'll take much longer then an operation on memory.
For instance, accessing a File in your file system and reading each line takes much longer then simply iterating over a list of the same size in memory.
The same can be said about database operations, they take much longer then in-memory operations, and so some caution should be used not to abuse these operations.
This is, I repeat, a very simplistic explanation. Exactly what costly means depends on your particular context, the number of operations you're performing, and the overall architecture of the system.

Categories