I am writing a ETL project in JAVA. I will connect to the source database, get the data only once, do some transformations and Load the data to a target database.
The point is that I am not connecting to the source or the target database multiple times repeatedly. I just connect once (using JDBC), get the data I need and close the connection.
Should I still use the connection pooling?
Thank you for your views!
Connection pooling is used to get around the fact that many database drivers take a long time to create a connection. If you only need to use it shortly, and then discard it, the overhead might be substantial (both in time and cpu) if you need many connections. It is simply faster to reuse than to create a new.
If you do not have that need, there is no reason to set up a connection pool if you don't have it already. If you happen to have one already, then just use that.
My guess is that in some circonstances, using several threads and concurrent connections could improve the overvall throughput of your software allowing for exemple to use all CPU of your RDBMS server, or of the client ETL. This also could help using the fact that several tables could sit physically on differents hardware and thus could be accessed in parallel.
The real impact would really depend of the computers you use and the architecture of the database.
Be carefull that typically ETL have ordering constraints and doing several things at the same time should not violate theses constraints.
Edit : An exemple of this. You can configure Oracle to execute each requests using several cores or not. (Depending of configuration and licence if I understand right). So if one request is allowed to use only one core, using several connections at the same time will allow several requests as the same time and better use the CPU resources of the server.
Related
i'm about to improve the efficiency of a cache heavy system, which has the following properties/architecture:
The system has 2 components, a single instance backend and multiple frontend instances, spread across remote data centers.
The backend generates data and writes it to a relational database that is replicated to multiple data centers.
The frontends handle client requests (common web traffic based) by reading data from the database and serving it. Data is stored in a local cache for an hour before it expires and has to be retrieved again.
(The cache’s eviction policy is LRU based).
I want to mention that there are two issues with the implementation above:
It turns out that many of the database accesses are redundant because the underlying data didn't actually change.
On the other hand, a change isn't reflected until the cache TTL elapses, causing staleness issues.
can you advice for a solution that fixes both of these problems?
should the solution change if the data was stored in nosql db like cassandra and not a classic database?
Unfortunately, there is no silver bullet here. There are two obvious variants:
Keep long TTL or cache forever, but invalidate the cache data when it is updated. This might get quite complex and error prone
Simply lower the TTL to get faster updates. The low TTL approach is IMHO the KISS approach. We go as low as 27 seconds. A cache with this low TTL has not a lot hits during normal operation, but helps a lot when a flash crowd hits your application
In case your database is powerful enough and has acceptable latency the approach 2 is the simplest one.
If your database, does not have acceptable latency, or maybe your application is doing a multiple of sequential reads from the database per web request, then your can use a cache that provides refresh ahead or background refresh. This means, the cache refreshes the entries automatically and there is no additional latency except for the first read. However, this approach come with the downside of increasing the database load.
Cassandra may not support the same access strategies like the classic database. Changing to Cassandara will affect your caching as well, e.g. in case you cache also query results. However, the high level concept keeps the same. Your data access layers may change to an asynchronous or reactive pattern, since Cassandara has support for that.
If you want to do invalidation (solution 1), with Cassandara, you can get information from the database which data has updated, see CASSANDRA-8844. You may get similar information from "classical" SQL databases, but that is a vendor specific feature.
It is getting burdensome on my team to prototype tables in MySQL that back our Java business applications, so I'm campaigning to use SQLite to prototype new tables and test them in our applications.
These tables are usually lightweight parameters, holding 12 to 1000 records at most. When the Java applications use them they are likely to be doing so in a read-only capacity, and typically the data is ingested in memory and never read again.
Would I have a problem putting these prototype SQLite tables out on a network, as long as they are accessed via read-only and in small volume? Or should I copy them locally to everyone's machines? I know SQLite does not encourage concurrent access on a network, but I'd be surprised if more than one user would hit it a the same time given the number of users and the way our applications are architected.
If you are using a three-layer architecture, only the application server should have access to the database server. Therefore, you should have control over the connections (i.e. you can create a very small connection pool).
Embedded databases are not suited for lots (hundreds) of concurrent connections. Nevertheless, having into account the amount of data and that you will only focus on read-only queries, I doubt that would be a problem.
A major problem I foresee is that you can have serious problems in terms of SQL dialects. Usually embedded databases use the ANSI SQL standard, but mySQL and others allow you to use their own SQL dialects which are incompatible. It's usually a good practice to have a unit test that runs all the SQL queries against an embedded database to guarantee that they are ANSI-compliant. This way, you have a guarantee that you can use your application (automatically or manually) with the embedded database.
My Java application will need to gather information from different MySQL-tables on startup. Without the database information, the application cannot be used. Hence, it takes up to a few seconds to start up (reducing this time with cache when possible).
Would it be bad practise to preform each of these SQL-queries in a
separate thread, allowing computers with multiple CPU cores to start
the application eventually even faster (no guarantees, I know)?
Is there any cons that I need to know about before implementing this
"system"?
It's something your going to have to try.
If your bringing back relatively few rows from each table, it would probably take longer to establish the database connections in each of the threads (or jdbc connection pool) than to establish it once and do the queries.
Fortunately it's not a lot of code, so you should be able to try it out pretty quickly.
No, certainly not. For example, a Java web server like Tomcat makes it all the time, when multiple users access your web application concurrently.
Just make sure you manage properly your data integrity using transactions.
Executing the request by parallelizing them instead of executing them serially may be a good idea.
Take care to use a Datasource and each request must use its own connection to the database (never share a conenction between different threads simultaneously).
Be sure that your connection pool and your thread pool size is well adapted
Database sessions are relatively expensive objects. Parallelizing to a few threads is no problem, but don't create 1000 threads if you have 1000 tables.
Furthermore, multithreading comes with complexity and potentially huge maintenance costs (for example, unreproducible issues resulting from race conditions). So, do your measurements, and if you find out that the speed up is just a few percent, just put everything back.
There are more ways to avoid the latency you see. For example, you can send multiple queries in a single command batch, thus reducing the number of roundtrips between your code and the database.
I'm implementing several JavaSE applications on single server. Is it possible to setup a single connection pool (e.g. C3P0) and share among these applications? I just want to have an easy way to manage the total number of DB connections.
Is there any drawbacks using such centralized connection pool?
Thank you,
Wilson
You can simply use the same data source defined in the server for all application to share the same DB connection pool easily.
One obvious drawback would be that performance of independent application may degrade due to a load on totally unrelated application which would be hard to figure out.
I have to write an architecture case study but there are some things that i don't know, so i'd like some pointers on the following :
The website must handle 5k simultaneous users.
The backend is composed by a commercial software, some webservices, some message queues, and a database.
I want to recommend to use Spring for the backend, to deal with the different elements, and to expose some Rest services.
I also want to recommend wicket for the front (not the point here).
What i don't know is : must i install the front and the back on the same tomcat server or two different ? and i am tempted to put two servers for the front, with a load balancer (no need for session replication in this case). But if i have two front servers, must i have two back servers ? i don't want to create some kind of bottleneck.
Based on what i read on this blog a really huge charge is handle by one tomcat only for the first website mentionned. But i cannot find any info on this, so i can't tell if it seems plausible.
If you can enlight me, so i can go on in my case study, that would be really helpful.
Thanks :)
There are probably two main reasons for having multiple servers for each tier; high-availability and performance. If you're not doing this for HA reasons, then the unfortunate answer is 'it depends'.
Having two front end servers doesn't force you to have two backend servers. Is the backend going to be under a sufficiently high load that it will require two servers? It will depend a lot on what it is doing, and would be best revealed by load testing and/or profiling. For a site handling 5000 simultaneous users, though, my guess would be yes...
It totally depends on your application. How heavy are your sessions? (Wicket is known for putting a lot in the session). How heavy are your backend processes.
It might be a better idea to come up with something that can scale. A load-balancer with the possibility to keep adding new servers for scaling.
Measurement is the best thing you can do. Create JMeter scripts and find out where your app breaks. Built a plan from there.
To expand on my comment: think through the typical process by which a client makes a request to your server:
it initiates a connection, which has an overhead for both client and server;
it makes one or more requests via that connection, holding on to resources on the server for the duration of the connection;
it closes the connection, generally releasing application resources, but generally still hogging a port number on your server for some number of seconds after the conncetion is closed.
So in designing your architecture, you need to think about things such as:
how many connections can you actually hold open simultaneously on your server? if you're using Tomcat or other standard server with one thread per connection, you may have issues with having 5,000 simultaneous threads; (a NIO-based architecture, on the other hand, can handle thousands of connections without needing one thread per connection); if you're in a shared environment, you may simply not be able to have that many open connections;
if clients don't hold their connections open for the duration of a "session", what is the right balance between number of requests and/or time per connection, bearing in mind the overhead of making and closing a connection (initialisation of encrypted session if relevant, network overhead in creating the connection, port "hogged" for a while after the connection is closed)
Then more generally, I'd say consider:
in whatever architecture you go for, how easily can you re-architecture/replace specific components if they prove to be bottlenecks?
for each "black box" component/framework that you use, what actual problem does it solve for you, and what are its limitations? (Don't just use Tomcat because your boss's mate's best man told them about it down the pub...)
I would also agree with what other people have said-- at some point you need to not be too theoretical. Design something sensible, then run a test bed to see how it actually copes with your expected volumes of data. (You might not have the whole app built, but you can start making predictions about "we're going to have X clients sending Y requests every Z minutes, and p% of those requests will take n milliseconds and write r rows to the database"...)