What is the best way to write a data loader?

What is the best way to write a data loader? - java

I am using Spring 2.5 and the Hibernate that goes with it. I'm running against an Oracle 11g database.
I have created my DAOs which extend HibernateTemplate. Now I want to write a loader that inserts 5 million rows in my person table. I have written this in a simple minded fashion like read a row from a CSV file, turn it into a person, save into the table. Keep doing this until CSV file is empty.
The problem is that I run out of heap space at about 450000 rows. So I double the size of memory from 1024m to 2048m and now I run out of memory after about 900000 rows.
Hmmmmm....
So I've read some things about turning off the query cache for Hibernate, but I'm not using a L2 cache, so I don't think this is the issue.
I've read some things about JDBC2 batching, but I don't think that applies to hibernate.
So, I'm wondering if maybe there's a fundamental thing about Hibernate that I'm missing.

To be honest I wouldn't be using hibernate for that. ORMs are not designed to load million of rows into DBs. Not saying that you can't, but it's a bit like digging a swimming pool with a electric drill; you'd use an excavator for that, not a drill.
In your case, I'd load the CSV directly to the DB with a loader application that comes with databases. If you don't want to do that, yes, batch inserts will be way more efficient. I don't think Hibernate let's you do that easily though. If I were you I'd just use plain JDBC, or at most Spring JDBC.
If you have complicated businesslogic in the entities and absolutely have to use Hibernate, you could flush every N records as Richard suggests. However, I'd consider that a pretty bad hack.

In my experience with EclipseLink, holding a single transaction open while inserting/updating many records results in the symptoms you've experienced.
You are working with an EntityManager (of some sort, JPA or Hybernate specific - it's still managing Entitys). It's trying to keep the working set in memory, for the life of the transaction.
A general solution was to commit & the restart the transaction after every N inserts; a typical N for me was 1000.
As a footnote, with some version (undefined, it's been a few years) of EclipseLink, a session flush/clear didn't solve the problem.

It sounds like you are running out of space due to your first-level cache (the Hibernate session). You can flush the Hibernate session periodically to keep memory usage down, and break up the work into chunks by committing every few thousand rows, keeping the database's transaction log from getting too big.
But using Hibernate for a load task like that will be slow, because JDBC is slow. If you have a good idea what the environment will be like, you have a cap on the amount of data, and you have a big enough window for processing, then you can manage, but in a situation where you want it to work in multiple different client sites and you want to minimize the time spent on figuring out problems due to some client site's load job not working, then you should go with the database's bulk-copy tool.
The bulk-copy approach means the database suspends all constraint checking and index-building and transaction logging, instead it concentrates on slurping the data in as fast as possible. Because JDBC doesn't get anything like this level of cooperation from the database it can't compete. At a previous job we replaced a JDBC loader task that took over 8 hours to run with a SQLLoader task that took 20 minutes.
You do sacrifice database independence, but all databases have a bulk-copy tool (because DBAs rely on them) so you will have a very similar process for each database, only the exe you invoke and the way the file formatting is specified should change. And this way you make the best use of your processing window.

Related

Keeping neo4j updated with production MSSQL

I'm investigating the possibility of using neo4j to handle some of the queries of our java web application that simply take too long to run on MSSQL as they require so many joins on large tables, even with indexes implemented.
I am however concerned about the time that it might take to complete the ETL ultimately impacting on how outdated the information may be when queries.
Can someone advise on either a production strategy or toolkit / library that can assist in reading a production sql-server database (using deltas if possible to optimise) and updating a running instance of a neo4j database? I imagine that there will have to be some kind of mapping configuration but the idea is to have this run in an automated manner, updating the neo4j database with one or more sql-server table or view contents.

The direct way to connect a MS SQL database to a Neo4j database would be using the apoc.load.jdbc procedure.
For an initial load you can use Neo4j ETL (https://neo4j.com/blog/rdbms-neo4j-etl-tool/).
There is however no way around the fact that some planning and work will be involved if you want to keep two databases in sync (and if the logic involved goes beyond a few simple queries) continiously. You might want to offload a delta every so often (monthly, daily, hourly, ...) into CSV files and load those (with CYPHER syntax determining what needs to be added, removed, changed or connected) with LOAD CSV.
Sadly enough there's no such thing as a free lunch.
Hope this helps,
Tom

JPA and concurrent transactions with MySQL

I am not a DB expert, so maybe somebody can help me out with the following.
I use JPA icm MySQL on the Play! Framework.
As I in the past ran into problems with DB locks on a data collection routine, I made this routine read-only. Now when I need to make changes from this routine I do it concurrently.
Worked fine so far, but now I run into a problem. When a new user connects a record is created and the ID passed back. But I need to get the Model, not just the id.
When I search for the 'id' on my test environment this is fine, I just search for it (mem db) but with MYSQL the EM returns with nothing. It seems to me that with JPA-MySQL no changes made before the transaction start (even if bread-only) are included in future searches.
Is this correct, and is there a way around this? I can rewrite the procedure to make it readable again, and see if I can optimise it more so it would not result in other threads (or server) to run into a problem in obtaining a db lock. Possibly the best in the long term, but I am looking for a quicker solution for the moment.

The Transaction Isolation Level was set too high. Changing it solved the problem.
In the Play! Framework (1.2.5) it can be done like this:
%prod-test.db.isolation=READ_COMMITTED

Running Neo4j purely in memory without any persistence

I don't want to persist any data but still want to use Neo4j for it's graph traversal and algorithm capabilities. In an embedded database, I've configured cache_type = strong and after all the writes I set the transaction to failure. But my write speeds (node, relationship creation speeds) are a slow and this is becoming a big bottleneck in my process.
So, the question is, can Neo4j be run without any persistence aspects to it at all and just as a pure API? I tried others like JGraphT but those don't have traversal mechanisms like the ones Neo4j provides.

As far as I know, Neo4J data storage and Lucene indexes are always written to files. On Linux, at least, you could set up a ramfs filing system to hold the files in-memory.
See also:
Loading all Neo4J db to RAM

How many changes do you group in each transaction? You should try to group up to thousands of changes in each transaction since committing a transaction forces the logical log to disk.
However, in your case you could instead begin your transactions with:
db.tx().unforced().begin();
Instead of:
db.beginTx();
Which makes that transaction not wait for the logical log to force to disk and makes small transactions much faster, but a power outage could have you lose the last couple of seconds of data potentially.
The tx() method sits on GraphDatabaseAPI, which for example EmbeddedGraphDatabase implements.

you can try a virtual drive. It would make neo4j persist to the drive, but it would all happen in memory
https://thelinuxexperiment.com/create-a-virtual-hard-drive-volume-within-a-file-in-linux/

Spring JDBC vs JDBC

I have been trying to use spring 3.0 SimpleJdbcTemplate and it takes 5 mins to insert 1500 records, whereas it take me a few secs. to insert using straight JDBC. Not sure what I am doing wrong.

If you are building batch consider using Spring batch - JdbcBatchItemWriter with proper chunk size settings, that will load these 1500 records in less than a second.

Some things worth checking:
The overhead might be on the transaction managed by Spring at the application level. Look what kind of transaction manager you are using (look for a bean with name transactionManager). If you are using JTA, that's probably where your problem is. Since it's fast with JDBC the bottleneck doesn't seem to be the db.
Depending on how your app is using that transaction, it might be holding everything in memory before it finishes all 1500 requests and commits. Do you see a large difference in memory usage (the Spring one should be a lot higher)?
What kind of DB connection pool are you using in either of the cases?
Quick way to profile your app:
Get the pid - "jps -l"
Memory: jmap -histo PID (check if there's some form of memory leak)
Check what's going on under the hood: jstack PID (look for slow or recursive method calls)

How about using
jdbcTemplate.batchUpdate(new String[]{sql});

What is the best way to serialize an EMF model instance?

I have an Eclipse RCP application with an instance of an EMF model populated in memory. What is the best way to store that model for external systems to access? Access may occur during and after run time.
Reads and writes of the model are pretty balanced and can occur several times a second.
I think a database populated using Hibernate + Teneo + EMF would work nicely, but I want to know what other options are out there.

I'm using CDO (Connected Data Objects) in conjunction with EMF to do something similar. If you use the examples in the Eclipse wiki, it doesn't take too long to get it running. A couple of caveats:
For data that changes often, you probably will want to use nonAudit mode for your persistence. Otherwise, you'll save a new version of your EObject with every commit, retaining the old ones as well.
You can choose to commit every time your data changes, or you can choose to commit at less frequent intervals, depending on how frequently you need to publish your updates.
You also have fairly flexible locking options if you choose to do so.
My application uses Derby for persistence, though it will be migrated to SQL Server before long.
There's a 1 hour webinar on Eclipse Live (http://live.eclipse.org/node/635) that introduces CDO and gives some good examples of its usage.

I'd go with Teneo to do the heavy lifting unless performance is a real problem (which it won't be unless your models are vast). Even if it is slow you can tune it using JPA annotations.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.