Spring JDBC vs JDBC - java

I have been trying to use spring 3.0 SimpleJdbcTemplate and it takes 5 mins to insert 1500 records, whereas it take me a few secs. to insert using straight JDBC. Not sure what I am doing wrong.

If you are building batch consider using Spring batch - JdbcBatchItemWriter with proper chunk size settings, that will load these 1500 records in less than a second.

Some things worth checking:
The overhead might be on the transaction managed by Spring at the application level. Look what kind of transaction manager you are using (look for a bean with name transactionManager). If you are using JTA, that's probably where your problem is. Since it's fast with JDBC the bottleneck doesn't seem to be the db.
Depending on how your app is using that transaction, it might be holding everything in memory before it finishes all 1500 requests and commits. Do you see a large difference in memory usage (the Spring one should be a lot higher)?
What kind of DB connection pool are you using in either of the cases?
Quick way to profile your app:
Get the pid - "jps -l"
Memory: jmap -histo PID (check if there's some form of memory leak)
Check what's going on under the hood: jstack PID (look for slow or recursive method calls)

How about using
jdbcTemplate.batchUpdate(new String[]{sql});

Related

What happens on the DB side when I use multi-threading for update operations?

Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.

What is the best way to write a data loader?

I am using Spring 2.5 and the Hibernate that goes with it. I'm running against an Oracle 11g database.
I have created my DAOs which extend HibernateTemplate. Now I want to write a loader that inserts 5 million rows in my person table. I have written this in a simple minded fashion like read a row from a CSV file, turn it into a person, save into the table. Keep doing this until CSV file is empty.
The problem is that I run out of heap space at about 450000 rows. So I double the size of memory from 1024m to 2048m and now I run out of memory after about 900000 rows.
Hmmmmm....
So I've read some things about turning off the query cache for Hibernate, but I'm not using a L2 cache, so I don't think this is the issue.
I've read some things about JDBC2 batching, but I don't think that applies to hibernate.
So, I'm wondering if maybe there's a fundamental thing about Hibernate that I'm missing.
To be honest I wouldn't be using hibernate for that. ORMs are not designed to load million of rows into DBs. Not saying that you can't, but it's a bit like digging a swimming pool with a electric drill; you'd use an excavator for that, not a drill.
In your case, I'd load the CSV directly to the DB with a loader application that comes with databases. If you don't want to do that, yes, batch inserts will be way more efficient. I don't think Hibernate let's you do that easily though. If I were you I'd just use plain JDBC, or at most Spring JDBC.
If you have complicated businesslogic in the entities and absolutely have to use Hibernate, you could flush every N records as Richard suggests. However, I'd consider that a pretty bad hack.
In my experience with EclipseLink, holding a single transaction open while inserting/updating many records results in the symptoms you've experienced.
You are working with an EntityManager (of some sort, JPA or Hybernate specific - it's still managing Entitys). It's trying to keep the working set in memory, for the life of the transaction.
A general solution was to commit & the restart the transaction after every N inserts; a typical N for me was 1000.
As a footnote, with some version (undefined, it's been a few years) of EclipseLink, a session flush/clear didn't solve the problem.
It sounds like you are running out of space due to your first-level cache (the Hibernate session). You can flush the Hibernate session periodically to keep memory usage down, and break up the work into chunks by committing every few thousand rows, keeping the database's transaction log from getting too big.
But using Hibernate for a load task like that will be slow, because JDBC is slow. If you have a good idea what the environment will be like, you have a cap on the amount of data, and you have a big enough window for processing, then you can manage, but in a situation where you want it to work in multiple different client sites and you want to minimize the time spent on figuring out problems due to some client site's load job not working, then you should go with the database's bulk-copy tool.
The bulk-copy approach means the database suspends all constraint checking and index-building and transaction logging, instead it concentrates on slurping the data in as fast as possible. Because JDBC doesn't get anything like this level of cooperation from the database it can't compete. At a previous job we replaced a JDBC loader task that took over 8 hours to run with a SQLLoader task that took 20 minutes.
You do sacrifice database independence, but all databases have a bulk-copy tool (because DBAs rely on them) so you will have a very similar process for each database, only the exe you invoke and the way the file formatting is specified should change. And this way you make the best use of your processing window.

Running Neo4j purely in memory without any persistence

I don't want to persist any data but still want to use Neo4j for it's graph traversal and algorithm capabilities. In an embedded database, I've configured cache_type = strong and after all the writes I set the transaction to failure. But my write speeds (node, relationship creation speeds) are a slow and this is becoming a big bottleneck in my process.
So, the question is, can Neo4j be run without any persistence aspects to it at all and just as a pure API? I tried others like JGraphT but those don't have traversal mechanisms like the ones Neo4j provides.
As far as I know, Neo4J data storage and Lucene indexes are always written to files. On Linux, at least, you could set up a ramfs filing system to hold the files in-memory.
See also:
Loading all Neo4J db to RAM
How many changes do you group in each transaction? You should try to group up to thousands of changes in each transaction since committing a transaction forces the logical log to disk.
However, in your case you could instead begin your transactions with:
db.tx().unforced().begin();
Instead of:
db.beginTx();
Which makes that transaction not wait for the logical log to force to disk and makes small transactions much faster, but a power outage could have you lose the last couple of seconds of data potentially.
The tx() method sits on GraphDatabaseAPI, which for example EmbeddedGraphDatabase implements.
you can try a virtual drive. It would make neo4j persist to the drive, but it would all happen in memory
https://thelinuxexperiment.com/create-a-virtual-hard-drive-volume-within-a-file-in-linux/

Accelerate application startup on Jetty

I have a small web application configured with Guice, Jersey and EclipseLink, and run this application on jetty (8.0.0.M1) during development. There are about 10 (small) JPA managed classes (entities and embeddables), and about 20 classes total.
The initial startup takes 15 seconds + 5 seconds for the first requests. It seems like JPA is working on the first request, since I have the table generation strategy "create" enabled and see some JPA output from Maven on the first request.
A reload takes about 10 seconds and the first request after reloading takes about 3 to 4 seconds.
You may think, that the startup time is not so bad, but I'm wondering if I could accelerate the startup to work more fluently like with Django. Any idea for startup tuning?
I'm afraid that if you are not prepared to remove the table creation strategy, you will have to tolerate such loading times. In essence, everytime your startup your application, it will drop/create/verify the tables and issue the correct DDL statements to make it match the entities in your package.
Assuming that you're done defining your entities and you are working on some business-logic code, you can create the database once, and just re-use your initial setup.
I imagine you are using Jetty for rapid application development (RAD) and you want to see and test out any changes as quickly as possible. If there is no actual "persistent" requirement on your RAD environment's database, you could try moving to an im-memory DB engine. DB engine's like HSQL allow you to spin up new tables (and other structures) very rapidly compared to actual production quality DB engines. This would require that you use an ORM because HSQL's SQL is very different then most other databases but it sounds like you are already using JPA so this shouldn't be difficult.
The only alternative I see is using a database which has it's schema already created appropriately and not dropping it every time.

Hibernate Monitoring Solution

I would like to monitor hibernate action.
I see on the internet the zentracker monitor solution that permit to monitor a lot of activity of hibernate.
But It is compatible with the last version of hibernate 3.5.*?
if it's not, do you have solution to monitor query execution time, sessionFactory opened, persitence object created, ... ?
Thank you in advance for your help.
Best regards,
Florent
P.S: I'm french, sorry for my english.
I see on the internet the zentracker monitor solution that permit to monitor a lot of activity of hibernate. But It is compatible with the last version of hibernate 3.5.x?
Why don't you get the sources and recompile the project with a more recent version of Hibernate Core? Well, I did because I was curious and it doesn't compile, there are a few API changes that require some modifications. But nothing overcomplicated though. And since the project doesn't seem to be very active, your best option would be to make them yourself.
If it's not, do you have solution to monitor query execution time, sessionFactory opened, persitence object created, ... ?
Well, as I said, you can make it compatible...
I personally gather Statistics via JMX and use a custom tool. From the documentation:
20.6.2. Metrics
Hibernate provides a number of
metrics, from basic information to
more specialized information that is
only relevant in certain scenarios.
All available counters are described
in the Statistics interface API, in
three categories:
Metrics related to the general Session usage, such as number of open
sessions, retrieved JDBC connections,
etc.
Metrics related to the entities, collections, queries, and caches as a
whole (aka global metrics).
Detailed metrics related to a particular entity, collection, query
or cache region.
For example, you can check the cache
hit, miss, and put ratio of entities,
collections and queries, and the
average time a query needs. Be aware
that the number of milliseconds is
subject to approximation in Java.
Hibernate is tied to the JVM precision
and on some platforms this might only
be accurate to 10 seconds.
Have a look at Performance Monitoring using Hibernate for more inspiration.
See also
Hibernate Profiler: a commercial tool
Related questions
Tool for monitoring Hibernate cache usage
References
Hibernate Core Reference Guide
20.6. Monitoring performance

Categories