how to (dynamically) determine optimal db number of connections? - java

How would you go about dynamically configuring the maximum number of connections in a DB connection pool?
I've all but given up on using a "hard coded" (configuration file, but still) number of connections. Some of the time, more connections provide better performance. On other times, less connections do a better job.
What measurement would you use to determine if you've opened too many connections and are actually hurting performance by it? Please keep in mind I can't just "stop the world" to run a performance test - I need something that I could my own query responses (of which I have no specific measurement - some are slow, some are fast, and I can't know in advance which is which) to determine.
(please note I'm using Java JDBC with underlying DataDirect drivers)
Is this approach used somewhere (and was it successful)? If not, how would you go about solving the "what is the optimal number of connections" when you have to support both Oracle and MS SQL, both for several versions and the queries vary wildly in nature (indexed lookup / non-indexed lookup / bulk data fetching / condition matching (indexed and non indexed, with and without wildcards))?
[I know this is similar to optimal-number-of-connections-in-connection-pool question, but I'm asking about dynamic configuration while he's asking about static one]

If you queue users to wait for a free database connection, they are waiting on something unknown or artificially imposed.
If you let them through to the database, you'll at least find out what resource is being fought over (if any). For example, if it is disk I/O, you may move your files around to spread activity against more or different disks or buy some SSD or more memory for cache. But at least you know what your bottleneck is and can take steps to address it.
If there is some query or service hogging resource, you should look into resource manager to segregate/throttle those sessions.
You probably also want to close off unused sessions (so you may have a peak of 500 sessions at lunch, but drop that to 50 overnight when a few bigger batch jobs are running).

You need free flowing connection pool which auto adjusts according to the load. So it should have:-
1) Min size: 0
2) Max size: as per ur DB configuration
3) increment by 1 if available connections are out of stock
4) abandon connection if it is idel for X (configured time) seconds
5) Connection pool should release the abandoned connections.
Witht this settings the connection pool should manage the number of connections based on the load dynamically.

closing to lack of interest. We ended up using a high maximal value and it didn't seem to bother the DB much.

Related

Available memory based thread scheduler

I am looking to make one of my applications more efficient from resource utilization perspective and would like to get your inputs to help me with the same.
This application connects to a number of databases. On every db, it runs a query, brings a bunch of records to memory and performs some operations.
I am using Executors.newFixedThreadPool(n) for spawning multiple threads, each to handle task corresponding a db. However, depending on the number of records fetched for the dbs being processed at a given point of time, the memory footprint fluctuates.
In an ideal scenario, I would have wanted to reduce my thread pool size (not supported in the current setup) in case the available memory gets lower than a threshold. The scheduler could essentially defer picking up the next task until we have sufficient memory available.
My question is whether such intelligent scheduling logic is already available somewhere that I can use or need to build it from scratch?
Thanks.

how to choose maximum connection pool size?

<property name="hibernateProperties">
<props>
<prop key="hibernate.c3p0.max_size">?</prop>
</props>
</property>
is it just a random number guess ? or are there any studies that suggest to use a particular range for a specific use case?
Well, assuming that you are talking about a production system, then "random number guess" is definitely not the answer. Actually, "random number guess" is never the answer for any sort of configuration in a production environment. Same goes for "just accepting the product's defaults".
Connection pool properties (such as max_size) depend on a few factors:
Anticipated use. For example, if you are expecting a typical usage pattern of 10 concurrent users, with an occasional burst of 20 users, there is little point in setting the number 50 as a maximum size even if you think that your machine can handle it. While it seems harmless at first sight, you have to remember that database connections are an expensive resource and, sometimes, you might want to actually cap the usage to match your expectation, if only to put your own assumptions into test and getting to know the real, production-like, usage pattern of the system and to prevent your application from being a resource hog, potentially affecting other applications.
Available resources. If you know (and that is easy to verify) that your database can only accept 30 connections at a time, then setting a number larger than 30 is senseless, regardless of the application's usage pattern.
Application design. Is your application going to mainly use short-lived connections? long-lived connections? Are you setting up a timeout for JDBC calls, so your JDBC calls are time-limited to begin with? For example, there's a difference in how you would configure a connection pool when you know that you're setting a timeout of 30 seconds per operation, comparing to how you would define that pool knowing that you set a timeout of 2 minutes.
Specific product considerations. I am not sure about c3p0, but certain containers that provide connection pooling mechanisms carry their own factors into the equation. If you are using the connection pooling functionality provided by a container, you should read that container's documentation to see whether the container's vendor has some insight with regards to configuring the connection pooling mechanism they provide you with.
... Just don't guesstimate a number.
... And don't just assume product's defaults.

memcached and performance

I might be asking very basic question, but could not find a clear answer by googling, so putting it here.
Memcached caches information in a separate Process. Thus in order to get the cached information requires inter-process communication (which is generally serialization in java). That means, generally, to fetch a cached object, we need to get a serialized object and generally transport it to network.
Both, serialization and network communication are costly operations. if memcached needs to use both of these (generally speaking, there might be cases when network communication is not required), then how Memcached is fast? Is not replication a better solution?
Or this is a tradeoff of distribution/platform independency/scalability vs performance?
You are right that looking something up in a shared cache (like memcached) is slower than looking it up in a local cache (which is what i think you mean by "replication").
However, the advantage of a shared cache is that it is shared, which means each user of the cache has access to more cache than if the memory was used for a local cache.
Consider an application with a 50 GB database, with ten app servers, each dedicating 1 GB of memory to caching. If you used local caches, then each machine would have 1 GB of cache, equal to 2% of the total database size. If you used a shared cache, then you have 10 GB of cache, equal to 20% of the total database size. Cache hits would be somewhat faster with the local caches, but the cache hit rate would be much higher with the shared cache. Since cache misses are astronomically more expensive than either kind of cache hit, slightly slower hits are a price worth paying to reduce the number of misses.
Now, the exact tradeoff does depend on the exact ratio of the costs of a local hit, a shared hit, and a miss, and also on the distribution of accesses over the database. For example, if all the accesses were to a set of 'hot' records that were under 1 GB in size, then the local caches would give a 100% hit rate, and would be just as good as a shared cache. Less extreme distributions could still tilt the balance.
In practice, the optimum configuration will usually (IMHO!) be to have a small but very fast local cache for the hottest data, then a larger and slower cache for the long tail. You will probably recognise that as the shape of other cache hierarchies: consider the way that processors have small, fast L1 caches for each core, then slower L2/L3 caches shared between all the cores on a single die, then perhaps yet slower off-chip caches shared by all the dies in a system (do any current processors actually use off-chip caches?).
You are neglecting the cost of disk i/o in your your consideration, which is generally going to be the slowest part of any process, and is the main driver IMO for utilizing in-memory caching like memcached.
Memory caches use ram memory over the network. Replication uses both ram-memory as well as persistent disk memory to fetch data. Their purposes are very different.
If you're only thinking of using Memcached to store easily obtainable data such as 1-1 mapping for table records :you-re-gonna-have-a-bad-time:.
On the other hand if your data is the entire result-set of a complex SQL query that may even oveflow the SQL memory pool (and need to be temporarily written to disk to be fetched) you're going to see a big speed-up.
The previous example mentions needing to write data to disk for a read operation - yes it happens if the result set is too big for memory (imagine a CROSS JOIN) that means that you both read and write to that drive (thrashing comes to mind).
In A highly optimized application written in C for example you may have a total processing time of 1microsec and may need to wait for networking and/or serialization/deserialization (marshaling/unmarshaling) for a much longer time than the app execution time itself. That's when you'll begin too feel the limitations of memory-caching over the network.

Managing multiple outgoing TCP connections

My program needs to send data to multiple (about 50) "client" stations. Important bits of data must be sent over TCP to ensure arrival. The connections are mostly fixed and are not expected to change during a single period of activity of the program.
What do you think would be the best architecture for this? I've heard that creating a new thread per connection is generally not recommended, but is this recommendation valid when connections are not expected to change? Scalability would be nice to have but is not much of a concern as the number of client stations is not expected to grow.
The program is written in Java if it matters.
Thanks,
Alex
If scalability, throughput and memory usage are not a concern, then using 50 threads is an adequate solution. It has the advantage of being simple, and simplicity is a good thing.
If you want to be able to scale, or you are concerned about memory usage (N threads implies N thread stacks) then you need to consider an architecture using NIO selectors. However, the best architecture probably depends on things like:
the amount of work that needs to be performed for each client station,
whether the work is evenly spread (on average),
whether the work involves other I/O, access to shared data structures, etc and
how close the aggregate work is to saturating a single processor.
50 threads is fine, go for it. It hardly matters. Anything over 200 threads, start to worry..
I'd use thread pool anyway. Depending on your thread pool configuration it will create as many threads as you need but this solution is more scalable. It will be ok not only for 50 but also for 5000 clients.
Why don't you limit the amount of threads by using someting like a connection Pool?

Connection Pool Strategy: Good, Bad or Ugly?

I'm in charge of developing and maintaining a group of Web Applications that are centered around similar data. The architecture I decided on at the time was that each application would have their own database and web-root application. Each application maintains a connection pool to its own database and a central database for shared data (logins, etc.)
A co-worker has been positing that this strategy will not scale because having so many different connection pools will not be scalable and that we should refactor the database so that all of the different applications use a single central database and that any modifications that may be unique to a system will need to be reflected from that one database and then use a single pool powered by Tomcat. He has posited that there is a lot of "meta data" that goes back and forth across the network to maintain a connection pool.
My understanding is that with proper tuning to use only as many connections as necessary across the different pools (low volume apps getting less connections, high volume apps getting more, etc.) that the number of pools doesn't matter compared to the number of connections or more formally that the difference in overhead required to maintain 3 pools of 10 connections is negligible compared to 1 pool of 30 connections.
The reasoning behind initially breaking the systems into a one-app-one-database design was that there are likely going to be differences between the apps and that each system could make modifications on the schema as needed. Similarly, it eliminated the possibility of system data bleeding through to other apps.
Unfortunately there is not strong leadership in the company to make a hard decision. Although my co-worker is backing up his worries only with vagueness, I want to make sure I understand the ramifications of multiple small databases/connections versus one large database/connection pool.
Your original design is based on sound principles. If it helps your case, this strategy is known as horizontal partitioning or sharding. It provides:
1) Greater scalability - because each shard can live on separate hardware if need be.
2) Greater availability - because the failure of a single shard doesn't impact the other shards
3) Greater performance - because the tables being searched have fewer rows and therefore smaller indexes which yields faster searches.
Your colleague's suggestion moves you to a single point of failure setup.
As for your question about 3 connection pools of size 10 vs 1 connection pool of size 30, the best way to settle that debate is with a benchmark. Configure your app each way, then do some stress testing with ab (Apache Benchmark) and see which way performs better. I suspect there won't be a significant difference but do the benchmark to prove it.
If you have a single database, and two connection pools, with 5 connections each, you have 10 connections to the database. If you have 5 connection pools with 2 connections each, you have 10 connections to the database. In the end, you have 10 connections to the database. The database has no idea that your pool exists, no awareness.
Any meta data exchanged between the pool and the DB is going to happen on each connection. When the connection is started, when the connection is torn down, etc. So, if you have 10 connections, this traffic will happen 10 times (at a minimum, assuming they all stay healthy for the life of the pool). This will happen whether you have 1 pool or 10 pools.
As for "1 DB per app", if you're not talking to an separate instance of the database for each DB, then it basically doesn't matter.
If you have a DB server hosting 5 databases, and you have connections to each database (say, 2 connection per), this will consume more overhead and memory than the same DB hosting a single database. But that overhead is marginal at best, and utterly insignificant on modern machines with GB sized data buffers. Beyond a certain point, all the database cares about is mapping and copying pages of data from disk to RAM and back again.
If you had a large redundant table in duplicated across of the DBs, then that could be potentially wasteful.
Finally, when I use the word "database", I mean the logical entity the server uses to coalesce tables. For example, Oracle really likes to have one "database" per server, broken up in to "schemas". Postgres has several DBs, each of which can have schemas. But in any case, all of the modern servers have logical boundaries of data that they can use. I'm just using the word "database" here.
So, as long as you're hitting a single instance of the DB server for all of your apps, the connection pools et al don't really matter in the big picture as the server will share all of the memory and resources across the clients as necessary.
Excellent question. I don't know which way is better, but have you considered designing the code in such a way that you can switch from one strategy to the other with the least amount of pain possible? Maybe some lightweight database proxy objects could be used to mask this design decision from higher-level code. Just in case.
Database- and overhead-wise, 1 pool with 30 connections and 3 pools with 10 connections are largely the same assuming the load is the same in both cases.
Application-wise, the difference between having all data go through a single point (e.g. service layer) vs having per-application access point may be quite drastic; both in terms of performance and ease of implementation / maintenance (consider having to use distributed cache, for example).
Well, excellent question, but it's not easy to discuss using a several data bases (A) approach or the big one (B):
It depends on the database itself. Oracle, e.g. behaves differently from Sybase ASE regarding the LOG (and therefore the LOCK) strategy. It might be better to use several different & small data base to keep lock contention rate low, if there is a lot of parallel writes and the DB is using a pessimistic lock strategy (Sybase).
If the table space of the small data bases aren't spread over several disks, it might better be using one big data base for using the (buffer/cache) memory only for one. I think this is rarely the case.
Using (A) is scales better for a different reason than performance. You're able moving a hot spot data base on a different (newer/faster) hardware when needed without touching the other data bases. In my former company this approach was always cheaper than variant (B) (no new licenses).
I personally prefer (A) for reason 3.
Design, architecture, plans and great ideas falls short when there is no common sense or a simple math behind the. Some more practice and/or experience helps ... Here is a simple math of why 10 pools with 5 connections is not the same as 1 pool with 50 connection:
each pool is configured with min & max open connections, fact is that it will usually use (99% of the time) 50% of the min number (2-3 in case of 5 min) if it is using more that that this pool is mis-configured since it is opening and closing connections all the time (expensive) ... so we 10 pools with 5 min connections each = 50 open connections... means 50 TCP connections; 50 JDBC connections on top of them ... (have you debug a JDBC connection? you will be surprise how much meta data flows both ways ...)
If we have 1 pool (serving the same infrastructure above) we can set the min to 30 simple because it will be able to balance the extras more efficiently ... this means 20 less JDBS connections. I don't know about you but for me this is a lot ...
The devil s in the detail - the 2-3 connections that you leave in each pool to make sure it doesn't open/close all the time ...
Don't even want to go in the overhead of 10 pool management ... (I do not want to maintain 10 pools every one ever so different that the other, do you?)
Now that you get me started on this if it was me I would "wrap" the DB (the data source) with a single app (service layer anyone?) that would provide diff services (REST/SOAP/WS/JSON - pick your poison) and my applications won't even know about JDBC, TCP etc. etc. oh, wait google has it - GAE ...

Categories