Java MySQL JDBC Memory Leak

Java MySQL JDBC Memory Leak - java

Ok, so I have this program with many (~300) threads, each of which communicates with a central database. I create a global connection to the DB, and then each thread goes about its business creating statements and executing them.
Somewhere along the way, I have a massive memory leak. After analyzing the heap dump, I see that the com.mysql.jdbc.JDBC4Connection object is 70 MB, because it has 800,000 items in "openStatements" (a hash map). Somewhere it's not properly closing the statements that I create, but I cannot for the life of me figure out where (every single time I open one, I close it as well). Any ideas why this might be occurring?

I had exactly the same problem. I needed to keep 1 connection active for 3 threads and at the same time every thread had to execute a lot of statements (the order of 100k). I was very careful and I closed every statement and every resultset using a try....finally... algorithm. This way, even if the code failed in some way, the statement and the resultset were always closed. After running the code for 8 hours I was suprised to find that the necessary memory went from the initial 35MB to 500MB. I generated a dump of the memory and I analyzed it with Mat Analyzer from Eclipse. It turned out that one com.mysql.jdbc.JDBC4Connection object was taking 445MB of memory keeping alive some openStatements objects wich in turn kept alive aroun 135k hashmap entries, probably from all the resultsets. So it seems that even if you close all you statements and resultsets, if you do not close the connection, it keeps references to them and the GarbageCollector can't free the resources.
My solution: after a long search I found this statement from the guys at MySQL:
"A quick test is to add "dontTrackOpenResources=true" to your JDBC URL. If the memory leak
goes away, some code path in your application isn't closing statements and result sets."
Here is the link: http://bugs.mysql.com/bug.php?id=5022. So I tried that and guess what? After 8 hours I was around 40MB of memory required, for the same database operations.
Maybe a connection pool would be advisible, but if that's not an option, this is the next best thing I came around.

You know unless MySQL says so, JDBC Connections are NOT thread safe. You CANNOT share them across threads, unless you use a connection pool. In addition as pointed out you should be try/finally guaranteeing all statements, result sets, and connections are closed.

Once upon a time, whenever my code saw "server went away," it opened a new DB connection. If the error happened in the right (wrong!) place, I was left with some non-free()d orphan memory hanging around. Could something like this account for what you are seeing? How are you handling errors?

Without seeing your code (which I'm sure is massive), you should really consider some sort of more formal thread pooling mechanism, such as Apache Commons pool framework, Spring's JDBC framework, and others. IMHO, this is a much simpler approach, since someone else has already figured out how to effectively manage these types of situations.

Related

Memory/Heap Status after closing ResultSet in JDBC

ResultSet fetches records from database.
After using the resultset object we finally close the resultset.
Question is , once rs.close() is called, will it free the delete the fetched records from memory?
or
when JVM is facing shortage of space, garabage collector will be called to delete the resultSet?
If JVM is invoking GC when it faces shortage of memory, is it a good practice to call the Garbage collector manually in the java program to free up the space?

Result Sets are often implemented by using a database cursor. Calling resultSet.close() will release that cursor, so it will immediately free resources in the database.
The data read by a Result Set is often received in blocks of records. Calling resultSet.close() might "release" the last block, making it eligible for GC, but that would happen anyway once the resultSet itself goes out of scope and becomes eligible for GC, and that likely happens right after calling close(), so it really doesn't matter if calling close() releases Java memory early.
Java memory is only freed by a GC run. You don't control when that happens (calling System.gc() is only a hint, you don't have control).
You're considering the wrong things. What you should focus on is:
Making sure resources1 are always closed as soon a possible to free up database and system resources.
This is best done using try-with-resources.
Making sure you don't keep too much data, e.g. don't create objects for every row retrieved if you can process the data as you get it.
This is usually were memory leaks occur, not inside the JDBC driver.
1) E.g. ResultSet, Statement, Connection, InputStream, OutputStream, Reader, Writer, etc.

ResultSet.close() will immediately release all resources, except Blob, Clob and NClob objects. Release means resources will be freed when Garbage Collector decides so. Usually we don't have to worry about it.
However, some memory used by JDBC may remain used.
Suppose that the driver has some sort of cache built in, and that cache is connection-scoped. To release that memory, you'd have to close JDBC Connection.
E.g. MySQL JDBC has default fetch size of 0, meaning it loads entire table into memory and keeps it in the memory for all of your statements. What's the scope of this in-memory buffer? ;)
Anyway, if you suspect memory issues, have a look at your JDBC driver specifics.
Rule of thumb, explicit GC is never good idea. But for a quick look to determine if ResultSet.close()/Connection.close() release any resources, give it a try: inspect used/free memory, close(), gc(), inspect memory again. Without explicit GC you will hardly see any changes.

Explicit GC is a burden on JVM as it has to frequently check the memory usage and decide when to trigger it. Where as, setting the appropriate GC as per application requirement would be sufficient to handle the above scenarios.
ResultSet.close will mark the resources for garbage collection i.e. freed up the reference to mark the memory blocks as non-reachable. Also, for a jdbc, connection needs to be closed so that memory holding the connection cache can also be marked for gc.

Java - programmatically reduce application load when runs out of memory

No, really, that's what I'm trying to do. Server is holding onto 1600 users - back end long-running process, not web server - but sometimes the users generate more activity than usual, so it needs to cut its load down, specifically when it runs out of "resources," which pretty much means heap memory. This is a big design question - how to design this?
This might likely involve preventing OOM instead of recovering from them. Ideally
if(nearlyOutOfMemory()) throw new MyRecoverableOOMException();
might happen.
But that nearlyOutOfMemory() function I don't really know what might be.

Split the server into shards, each holding fewer users but residing in different physical machines.
If you have lots of caches, try to use soft references, which get cleared out when the VM runs out of heap.
In any case, profile, profile, profile first to see where CPU time is consumed and memory is allocated and held onto.

I have actually asked a similar question about handling OOM and it turns out that there's not too many options to recover from it. Basically you can:
1) invoke external shell script (-XX:OnOutOfMemoryError="cmd args;cmd args") which would trigger some action. The problem is that if OOM has happened in some thread which doesn't have a decent recovery strategy, you're doomed.
2) Define a threshold for Old gen which technically isn't OOM but a few steps ahead, say 80% and act if the threshold has been reached. More details here.

You could use Runtime.getRuntime() and the following methods:
freeMemory()
totalMemory()
maxMemory()
But I agree with the other posters, using SoftReference, WeakReference or a WeakHashMap will probably safe you the trouble of manually recovering from that condition.

A throttling, resource regulating servlet filter may be of use too. I did encounter DoSFilter of jetty/eclipse.

Using a concurrent hashmap to reduce memory usage with threadpool?

I'm working with a program that runs lengthy SQL queries and stores the processed results in a HashMap. Currently, to get around the slow execution time of each of the 20-200 queries, I am using a fixed thread pool and a custom callable to do the searching. As a result, each callable is creating a local copy of the data which it then returns to the main program to be included in the report.
I've noticed that 100 query reports, which used to run without issue, now cause me to run out of memory. My speculation is that because these callables are creating their own copy of the data, I'm doubling memory usage when I join them into another large HashMap. I realize I could try to coax the garbage collector to run by attempting to reduce the scope of the callable's table, but that level of restructuring is not really what I want to do if it's possible to avoid.
Could I improve memory usage by replacing the callables with runnables that instead of storing the data, write it to a concurrent HashMap? Or does it sound like I have some other problem here?

Don't create copy of data, just pass references around, ensuring thread safety if needed. If without data copying you still have OOM, consider increasing max available heap for application.
Drawback of above approach not using copy of data is that thread safety is harder to achieve, though.

Do you really need all 100-200 reports at the same time?
May be it's worth to limit the 1st level of caching by just 50 reports and introduce a 2nd level based on WeakHashMap?
When 1st level exceeds its size LRU will be pushed to the 2nd level which will depend on the amount of available memory (with use of WeakHashMap).
Then to search for reports you will first need to query 1st level, if value is not there query 2nd level and if value is not there then report was reclaimed by GC when there was not enough memory and you have to query DB again for this report.

Do the results of the queries depend on other query results? If not, whenever you discover the results in another thread, just use a ConcurrentHashMap like you are implying. Do you really need to ask if creating several unnecessary copies of data is causing your program to run out of memory? This should almost be obvious.

how to (dynamically) determine optimal db number of connections?

How would you go about dynamically configuring the maximum number of connections in a DB connection pool?
I've all but given up on using a "hard coded" (configuration file, but still) number of connections. Some of the time, more connections provide better performance. On other times, less connections do a better job.
What measurement would you use to determine if you've opened too many connections and are actually hurting performance by it? Please keep in mind I can't just "stop the world" to run a performance test - I need something that I could my own query responses (of which I have no specific measurement - some are slow, some are fast, and I can't know in advance which is which) to determine.
(please note I'm using Java JDBC with underlying DataDirect drivers)
Is this approach used somewhere (and was it successful)? If not, how would you go about solving the "what is the optimal number of connections" when you have to support both Oracle and MS SQL, both for several versions and the queries vary wildly in nature (indexed lookup / non-indexed lookup / bulk data fetching / condition matching (indexed and non indexed, with and without wildcards))?
[I know this is similar to optimal-number-of-connections-in-connection-pool question, but I'm asking about dynamic configuration while he's asking about static one]

If you queue users to wait for a free database connection, they are waiting on something unknown or artificially imposed.
If you let them through to the database, you'll at least find out what resource is being fought over (if any). For example, if it is disk I/O, you may move your files around to spread activity against more or different disks or buy some SSD or more memory for cache. But at least you know what your bottleneck is and can take steps to address it.
If there is some query or service hogging resource, you should look into resource manager to segregate/throttle those sessions.
You probably also want to close off unused sessions (so you may have a peak of 500 sessions at lunch, but drop that to 50 overnight when a few bigger batch jobs are running).

You need free flowing connection pool which auto adjusts according to the load. So it should have:-
1) Min size: 0
2) Max size: as per ur DB configuration
3) increment by 1 if available connections are out of stock
4) abandon connection if it is idel for X (configured time) seconds
5) Connection pool should release the abandoned connections.
Witht this settings the connection pool should manage the number of connections based on the load dynamically.

closing to lack of interest. We ended up using a high maximal value and it didn't seem to bother the DB much.

Connection Pool Strategy: Good, Bad or Ugly?

I'm in charge of developing and maintaining a group of Web Applications that are centered around similar data. The architecture I decided on at the time was that each application would have their own database and web-root application. Each application maintains a connection pool to its own database and a central database for shared data (logins, etc.)
A co-worker has been positing that this strategy will not scale because having so many different connection pools will not be scalable and that we should refactor the database so that all of the different applications use a single central database and that any modifications that may be unique to a system will need to be reflected from that one database and then use a single pool powered by Tomcat. He has posited that there is a lot of "meta data" that goes back and forth across the network to maintain a connection pool.
My understanding is that with proper tuning to use only as many connections as necessary across the different pools (low volume apps getting less connections, high volume apps getting more, etc.) that the number of pools doesn't matter compared to the number of connections or more formally that the difference in overhead required to maintain 3 pools of 10 connections is negligible compared to 1 pool of 30 connections.
The reasoning behind initially breaking the systems into a one-app-one-database design was that there are likely going to be differences between the apps and that each system could make modifications on the schema as needed. Similarly, it eliminated the possibility of system data bleeding through to other apps.
Unfortunately there is not strong leadership in the company to make a hard decision. Although my co-worker is backing up his worries only with vagueness, I want to make sure I understand the ramifications of multiple small databases/connections versus one large database/connection pool.

Your original design is based on sound principles. If it helps your case, this strategy is known as horizontal partitioning or sharding. It provides:
1) Greater scalability - because each shard can live on separate hardware if need be.
2) Greater availability - because the failure of a single shard doesn't impact the other shards
3) Greater performance - because the tables being searched have fewer rows and therefore smaller indexes which yields faster searches.
Your colleague's suggestion moves you to a single point of failure setup.
As for your question about 3 connection pools of size 10 vs 1 connection pool of size 30, the best way to settle that debate is with a benchmark. Configure your app each way, then do some stress testing with ab (Apache Benchmark) and see which way performs better. I suspect there won't be a significant difference but do the benchmark to prove it.

If you have a single database, and two connection pools, with 5 connections each, you have 10 connections to the database. If you have 5 connection pools with 2 connections each, you have 10 connections to the database. In the end, you have 10 connections to the database. The database has no idea that your pool exists, no awareness.
Any meta data exchanged between the pool and the DB is going to happen on each connection. When the connection is started, when the connection is torn down, etc. So, if you have 10 connections, this traffic will happen 10 times (at a minimum, assuming they all stay healthy for the life of the pool). This will happen whether you have 1 pool or 10 pools.
As for "1 DB per app", if you're not talking to an separate instance of the database for each DB, then it basically doesn't matter.
If you have a DB server hosting 5 databases, and you have connections to each database (say, 2 connection per), this will consume more overhead and memory than the same DB hosting a single database. But that overhead is marginal at best, and utterly insignificant on modern machines with GB sized data buffers. Beyond a certain point, all the database cares about is mapping and copying pages of data from disk to RAM and back again.
If you had a large redundant table in duplicated across of the DBs, then that could be potentially wasteful.
Finally, when I use the word "database", I mean the logical entity the server uses to coalesce tables. For example, Oracle really likes to have one "database" per server, broken up in to "schemas". Postgres has several DBs, each of which can have schemas. But in any case, all of the modern servers have logical boundaries of data that they can use. I'm just using the word "database" here.
So, as long as you're hitting a single instance of the DB server for all of your apps, the connection pools et al don't really matter in the big picture as the server will share all of the memory and resources across the clients as necessary.

Excellent question. I don't know which way is better, but have you considered designing the code in such a way that you can switch from one strategy to the other with the least amount of pain possible? Maybe some lightweight database proxy objects could be used to mask this design decision from higher-level code. Just in case.

Database- and overhead-wise, 1 pool with 30 connections and 3 pools with 10 connections are largely the same assuming the load is the same in both cases.
Application-wise, the difference between having all data go through a single point (e.g. service layer) vs having per-application access point may be quite drastic; both in terms of performance and ease of implementation / maintenance (consider having to use distributed cache, for example).

Well, excellent question, but it's not easy to discuss using a several data bases (A) approach or the big one (B):
It depends on the database itself. Oracle, e.g. behaves differently from Sybase ASE regarding the LOG (and therefore the LOCK) strategy. It might be better to use several different & small data base to keep lock contention rate low, if there is a lot of parallel writes and the DB is using a pessimistic lock strategy (Sybase).
If the table space of the small data bases aren't spread over several disks, it might better be using one big data base for using the (buffer/cache) memory only for one. I think this is rarely the case.
Using (A) is scales better for a different reason than performance. You're able moving a hot spot data base on a different (newer/faster) hardware when needed without touching the other data bases. In my former company this approach was always cheaper than variant (B) (no new licenses).
I personally prefer (A) for reason 3.

Design, architecture, plans and great ideas falls short when there is no common sense or a simple math behind the. Some more practice and/or experience helps ... Here is a simple math of why 10 pools with 5 connections is not the same as 1 pool with 50 connection:
each pool is configured with min & max open connections, fact is that it will usually use (99% of the time) 50% of the min number (2-3 in case of 5 min) if it is using more that that this pool is mis-configured since it is opening and closing connections all the time (expensive) ... so we 10 pools with 5 min connections each = 50 open connections... means 50 TCP connections; 50 JDBC connections on top of them ... (have you debug a JDBC connection? you will be surprise how much meta data flows both ways ...)
If we have 1 pool (serving the same infrastructure above) we can set the min to 30 simple because it will be able to balance the extras more efficiently ... this means 20 less JDBS connections. I don't know about you but for me this is a lot ...
The devil s in the detail - the 2-3 connections that you leave in each pool to make sure it doesn't open/close all the time ...
Don't even want to go in the overhead of 10 pool management ... (I do not want to maintain 10 pools every one ever so different that the other, do you?)
Now that you get me started on this if it was me I would "wrap" the DB (the data source) with a single app (service layer anyone?) that would provide diff services (REST/SOAP/WS/JSON - pick your poison) and my applications won't even know about JDBC, TCP etc. etc. oh, wait google has it - GAE ...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.