I have created a custom web API for submitting data to a MYSQL server for my Java application. Within my Java application I need to add/update 200 rows, and I already have it making these requests one at a time.
This can be pretty time consuming, so can I create threads for all
of these different connections?
Should I limit the number of maximum connections made at a time? Maybe like 10 at a time?
Will this cause any issues with mysql
possibly adding rows at exactly the same time? (No 2 rows would ever
be needed to be changed at a single point in time)
Inserting records in multiple connections will potentially speed up the 200 inserts, but you can only know for sure by testing and measuring after introducing multiple threads. I will also suggest trying JDBC batch and sending all 200 inserts to the database in one go (if that is possible in your implementation), as that might provide a performance boost by saving round trips to the database.
To create a connection pool, look at HikariCP which is a JDBC connection pool implementation. It will allow you to specify the min/max concurrent connections along with other settings. Your worker threads then can request connections from the pool and perform the inserts.
Inserting multiple records concurrently could have issues at mySQL level if it acquires a table lock for each insert. In that case you would not get a speed improvement with multiple threads and might need some tuning at the database level to work around it. Here's a good article that should help: High rate insertion with mySQL
Related
Before I ask this question, I am sure this has been asked before but I had a hard time filling in proper terms to find this. As a result I was unable to find any information. So I apologize if it has been asked before.
Consider the following scenario:
A game server is backed by an SQL database for player storage and logging. Every time a player logs in data is retrieved and written. Also every few seconds (20 seconds or something) the logs are written to the database including changed data about the players.
I am wondering how to handle these connections. Keeping the connection open forever is a bad idea because the MySQL server closes it after "inactivity".
Opening the connection each time works but I am wondering if it is the best approach or is there another possibility?
That is what connection pools are good for. Try HikariCP, its extremly fast. You can use it on top of plain JDBC, as well as JPA or O/R mappers. It will keep a set of connections open (pooling) and manage their reuse, if you have a lot of concurrent connections.
If you have to store logs in the database, there are several logging frameworks, that already have funtions to do so. For example Logback has a DBAppender that works on top of connection pools:
"..., sending 500 logging requests to the aforementioned MySQL database takes around 0.5 seconds, for an average of 1 millisecond per request, that is a tenfold improvement in performance." (source)
I am developing an online mobile game. I have several server machines running numerous instances of a Java socket server application.
Player data has to be stored somewhere (their profiles, items etc). I want to use the H2 database for this purpose.
Now, here's the tricky part: I want all the player data to be stored in the same H2 database. That is, all my server applications will access the data by remotely connecting to one particular machine over TCP, out of convenience.
The thing is, we are expecting a very large amount of clients on launch. For each client, a connection to the H2 database is created. The obvious concern here is whether one single H2 database process can handle so many connections concurrently.
From the website:
There is no limit on the number of database open concurrently per
server, or on the number of open connections.
Given the above fact, in theory, if our server machine has enough resources (memory, space, CPUs, etc), then yes, the H2 database should be able to handle as many concurrent connections as our resources allow.
But there is something unclear to me:
Does the H2 process create a thread for each remote connection? I ask this because I once read that in Windows (our VPS' OS), a thread is stored as a short type, and hence the max amount of threads an application can spawn is roughly 32,000 (I don't know the math they used to get that number). In that case, then the H2 process does have a limit of concurrent connections - which is troubling because I do indeed expect more than 32,000 clients connected.
Of course, it would seem wise to discard the idea of having one single H2 database for all my clients. But I'd like to know if the above statement is correct: can H2 handle more than 32,000 remote database connections?
Let take this by parts:
"Does the H2 process create a thread for each remote connection?"
An application should normally use one connection per thread. An H2 database synchronizes access to the same connection, but other databases may not do this.
"can H2 handle more than 32,000 remote database connections?"
If you want to access the same database at the same time from different processes or computers, you need to use the client / server mode. The JdbcConnectionPool class has the default maximum number of connections set to 10, but it provides a setter to change it if you want. In theory, you can set it to Integer.MAX_VALUE, but I don't think this is wise. Why? For starters, the synchronization point made on the previous section. Another point to consider is if your application opens and closes connections a lot (for example, for each request), you should consider using a connection pool. Opening a DB connection is very slow.
"Of course, it would seem wise to discard the idea of having one single H2 database for all my clients"
It might be, but you have to keep in mind that the number of open database is limited by the memory available. If you are running on a powerful server, it might be a good option to consider. Then again, it might not.
I have a Java app with more than 100 servers. Currently each server opens up connections to 7 database schemas in a relational databases (logging, this, that, the other). All schemas connect to the same DB cluster, but its all effectively one database instance.
The server managed connection pools open up a handfull of connections (1 - 5) on each database schema per instance, then double that on a redundant pool. So each server open up a minimum of 30 database connections and can grow to a maximum of several hundred per server, and again, there are more than 100 servers.
All in all, the minimum number of database connections used are 3000, and this can grow to ludricous.
Obviously this is just plain wrong. The database cluster can only effeciently handle X concurrent requests and any number of requests > X introduces unnecessary contention and slows the whole lot down. (X is unknown, but it is way smaller than the 3000 minimum concurrent connections).
I want to lower the total connections used by implementing the following strategy:
Connect to one schema only (Application-X), have 6 connections per pool maximum.
Write a layer above the pool that will switch to the schema I want. The getConnection(forSchema) function will take a parameter for the target schema (eg. logging), will get a connection that could last be pointing to any schema, and issue a schema switch SQL statement (set search_path to 'target_schema').
Please do not comment on whether this approach is right or wrong. Because 'it depends' needs to be considered, such comments will not add value.
My question is whether there is a DB pool implementation out there that already does this - allows me to have one set of connections and automatically places me at the right schema, or better yet - tracks whether a pooled connection is available for your target schema before making a decision to go ahead and switch the schema (saves a DB round trip).
I would also like to hear from anyone else who has a similar issue (real-world experience) if you solved it in a different way.
Having learned the hard way myself, the best way to stabilize the number of database connections between a web application and a database is to put a reverse proxy in front of the web application.
Here's why it works:
A slow part of a web request can be returning the data to the client. If there's a lot of data or the user is on a slow connection, the connection can remain open to the web server where the data dribbles out to the client slowly. Meanwhile, the application server continues to hold a database connection open to the backend server. While the database connection may only needed for a fraction of the transaction, it's tied up until client disconnects from the application server.
Now, consider what happens when a reverse proxy is added in front of the app server. When the app server has a response prepared, it can quickly reply back to the reverse proxy in front of it, and free up the database connection behind it. The reverse proxy can then handle slowly dribbling out responses to uses, without keeping a related database connection tied up.
Before I made this architectural change, there were a number of traffic spikes that resulted in death spirals: the database handle usage would spike to exhaustion, and things would go downhill from there.
After the change, the number of the database handles required was both far less and far more stable.
If a reverse proxy is not already part of your architecture, I recommend it as a first step to control the number of database connections you require.
Fundamentally, this question is about: Can the same DB connection be used across multiple processes (as different map-reduce jobs are in real different independent processes).
I know that this is a little trivial question but it would be great if somebody can answer this as well: What happens in case if the maximum number of connections to the DB(which is preconfigured on the server hosting the DB) have exhausted and a new process tries to get a new connection? Does it wait for sometime, and if yes, is there a way to set a timeout for this wait period. I am talking in terms of a PostGres DB in this particular case and the language used for talking to the DB is java.
To give you a context of the problem, I have multiple map-reduce jobs (about 40 reducers) running in parallel, each wanting to update a PostGres DB. How do I efficiently manage these DB read/writes from these processes. Note: The DB is hosted on a separate machine independent of where the map reduce job is running.
Connection pooling is one option but it can be very inefficient at times especially for several reads/writes per second.
Can the same DB connection be used across multiple processes
No, not in any sane or reliable way. You could use a broker process, but then you'd be one step away from inventing a connection pool anyway.
What happens in case if the maximum number of connections to the
DB(which is preconfigured on the server hosting the DB) have exhausted
and a new process tries to get a new connection?
The connection attempt fails with SQLSTATE 53300 too_many_connections. If it waited, the server could exhaust other limits and begin to have issues servicing existing clients.
For a problem like this you'd usually use tools like C3P0 or DBCP that do in-JVM pooling, but this won't work when you have multiple JVMs.
What you need to do is to use an external connection pool like PgBouncer or PgPool-II to maintain a set of lightweight connections from your workers. The pooler then has a smaller number of real server connections and shares those between the lightweight connections from clients.
Connection pooling is typically more efficient than not pooling, because it allows you to optimise the number of active PostgreSQL worker processes to the hardware and workload, providing admission control for work.
An alternative is to have a writer process with one or more threads (one connection per thread) that takes finished work from the reduce workers and writes to the DB, so the reduce workers can get on to their next unit of work. You'd need to have a way to tell the reduce workers to wait if the writer got too far behind. There are several Java queueing system implementations that would be suitable for this, or you could use JMS.
See IPC Suggestion for lots of small data
It's also worth optimizing how you write to PostgreSQL as much as possible, using:
Prepared statements
A commit_delay
synchronous_commit = 'off' if you can afford to lose a few transactions if the server crashes
Batching work into bigger transactions
COPY or multi-valued INSERTs to insert blocks of data
Decent hardware with a useful disk subsystem, not some Amazon EC2 instance with awful I/O or a RAID 5 box with 5400rpm disks
A proper RAID controller with battery backed write-back cache to reduce the cost of fsync(). Most important if you can't do big batches of work or use a commit delay; has less impact if your fsync rate is lower because of batching and group commit.
See:
http://www.postgresql.org/docs/current/interactive/populate.html
http://www.depesz.com/index.php/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/
I've been approached by a programmer that has background in Oracle forms and is moving into the Java realm. He asked me a question that I didn't have a good answer for. Instead of answering that; I always do it that way or that's how I was taught. I figured I'd do some research.
Question: With Java multi-thread capabilities; why don't you setup a JDBC connection to the database for each user, each on it's own thread? Instead of setting up a connection pool and applying security to which users can access the pool?
A connection pool scales better. If you have a dedicated connection per user, you will need 50 connections for 50 users. With a pool, you can probably do with something like 10 - 20 connections to handle the 50 users (depending on the use case). Now take a look at a bigger group and think about handling 500, 5000 or 50000 users and you will see that the 1 connection per user model does not scale.
With a connection pool (and a thread pool), each request will still be handled by one thread and by one database connection, but they will be taken from a pool instead of a dedicated one per user.
Because the overhead of maintaining a lot of threads is not efficient. What happens when there are more users than your application server or database can maintain? What happens when there's a lot of contention on a database table?
By using a connection pool, you're not having to "connect" to the database again from scratch so you get a big performance boost from re-using database connections from a pool.
Something like Apache's DBCP is a good place to start if you're not using an application server.