I'm investigating the possibility of using neo4j to handle some of the queries of our java web application that simply take too long to run on MSSQL as they require so many joins on large tables, even with indexes implemented.
I am however concerned about the time that it might take to complete the ETL ultimately impacting on how outdated the information may be when queries.
Can someone advise on either a production strategy or toolkit / library that can assist in reading a production sql-server database (using deltas if possible to optimise) and updating a running instance of a neo4j database? I imagine that there will have to be some kind of mapping configuration but the idea is to have this run in an automated manner, updating the neo4j database with one or more sql-server table or view contents.
The direct way to connect a MS SQL database to a Neo4j database would be using the apoc.load.jdbc procedure.
For an initial load you can use Neo4j ETL (https://neo4j.com/blog/rdbms-neo4j-etl-tool/).
There is however no way around the fact that some planning and work will be involved if you want to keep two databases in sync (and if the logic involved goes beyond a few simple queries) continiously. You might want to offload a delta every so often (monthly, daily, hourly, ...) into CSV files and load those (with CYPHER syntax determining what needs to be added, removed, changed or connected) with LOAD CSV.
Sadly enough there's no such thing as a free lunch.
Hope this helps,
Tom
Related
I have a working code that basically copies records from one database to another one using JPA. It works fine but it takes a while, so I wonder if there's any faster way to do this.
I thought Threads, but I get into race conditions and synchronizing those pieces of the code end up being as long as the one by one process.
Any ideas?
Update
Here's the scenario:
Application (Core) has a database.
Plugins have default data (same structure as Core, but with different data)
When the plugin is enabled it checks in the Core database and if not found it copies from it's default data into the core database.
Most databases provide native tools to support this. Unless you need to write additional custom logic to transform the data in some way, I would recommend looking at the export/import tools provided by your database vendor.
I am using Spring 2.5 and the Hibernate that goes with it. I'm running against an Oracle 11g database.
I have created my DAOs which extend HibernateTemplate. Now I want to write a loader that inserts 5 million rows in my person table. I have written this in a simple minded fashion like read a row from a CSV file, turn it into a person, save into the table. Keep doing this until CSV file is empty.
The problem is that I run out of heap space at about 450000 rows. So I double the size of memory from 1024m to 2048m and now I run out of memory after about 900000 rows.
Hmmmmm....
So I've read some things about turning off the query cache for Hibernate, but I'm not using a L2 cache, so I don't think this is the issue.
I've read some things about JDBC2 batching, but I don't think that applies to hibernate.
So, I'm wondering if maybe there's a fundamental thing about Hibernate that I'm missing.
To be honest I wouldn't be using hibernate for that. ORMs are not designed to load million of rows into DBs. Not saying that you can't, but it's a bit like digging a swimming pool with a electric drill; you'd use an excavator for that, not a drill.
In your case, I'd load the CSV directly to the DB with a loader application that comes with databases. If you don't want to do that, yes, batch inserts will be way more efficient. I don't think Hibernate let's you do that easily though. If I were you I'd just use plain JDBC, or at most Spring JDBC.
If you have complicated businesslogic in the entities and absolutely have to use Hibernate, you could flush every N records as Richard suggests. However, I'd consider that a pretty bad hack.
In my experience with EclipseLink, holding a single transaction open while inserting/updating many records results in the symptoms you've experienced.
You are working with an EntityManager (of some sort, JPA or Hybernate specific - it's still managing Entitys). It's trying to keep the working set in memory, for the life of the transaction.
A general solution was to commit & the restart the transaction after every N inserts; a typical N for me was 1000.
As a footnote, with some version (undefined, it's been a few years) of EclipseLink, a session flush/clear didn't solve the problem.
It sounds like you are running out of space due to your first-level cache (the Hibernate session). You can flush the Hibernate session periodically to keep memory usage down, and break up the work into chunks by committing every few thousand rows, keeping the database's transaction log from getting too big.
But using Hibernate for a load task like that will be slow, because JDBC is slow. If you have a good idea what the environment will be like, you have a cap on the amount of data, and you have a big enough window for processing, then you can manage, but in a situation where you want it to work in multiple different client sites and you want to minimize the time spent on figuring out problems due to some client site's load job not working, then you should go with the database's bulk-copy tool.
The bulk-copy approach means the database suspends all constraint checking and index-building and transaction logging, instead it concentrates on slurping the data in as fast as possible. Because JDBC doesn't get anything like this level of cooperation from the database it can't compete. At a previous job we replaced a JDBC loader task that took over 8 hours to run with a SQLLoader task that took 20 minutes.
You do sacrifice database independence, but all databases have a bulk-copy tool (because DBAs rely on them) so you will have a very similar process for each database, only the exe you invoke and the way the file formatting is specified should change. And this way you make the best use of your processing window.
We have to import a lot of data into a production database from Web services, flat files and other external sources.
We are using spring batch to do this.
One of the major problem is that some of those data are related to each other but won't be imported at the same time.
The other major problem is that there is a lot of data, so I can't really make a huge transaction and roll back if a problem occurs.
How could I do that?
Your best best is to load the data into a "holding" table that is not used by your running application. Then look at using SELECT INTO to copy the data into the application tables when the application is least busy.
The advantages to this approach are
Loading the data into the holding table has no locking implications on your application tables
Depending on your database configuration SELECT INTO can be done with minimal (or no) writing to the transaction log making it very efficient
Assuming that the database is not in serevice while you are doing this: back up the database, turn off all constraint checks, import your data, turn constraints back on. If all fails then at least you have the backup to fall back on.
I need to run a method in Java every time a specific table of an Oracle DB is updated (any sorts of updates, including record additions, deletions and modifications).
What is the most efficient way do "poll" a table for changes from Java that has good performance and does not put too much pressure on the DB?
Unfortunately I have many constraints:
I can't create additional tables, triggers, stored procedures etc., because I have no control over the DB administration / design.
I'd rather avoid Oracle Change Notification, as proposed in that post, as it seems to involve C/JNI.
Just counting the records is not good enough as I might miss modifications and simultaneous additions/deletions.
a delay of up to 30/60s between the actual change and the notification is acceptable.
The tables I want to monitor generally have 100k+ records (some 1m+) so I don't think pulling the whole tables is an option (from a DB load / performance perspective).
Starting from oracle 11g you can use Oracle Change Notification with plain JDBC driver Link
I've got an Oracle database that has two schemas in it which are identical. One is essentially the "on" schema, and the other is the "off" schema. We update data in the off schema and then switch the schemas behind an alias which our production servers use. Not a great solution, but it's what I've been given to work with.
My problem is that there is a separate application that will now be streaming data to the database (also handed to me) which is currently only updating the alias, which means it is only updating the "on" schema at any given time. That means that when the schemas get switched, all the data from this separate application vanishes from production (the schema it is in is now the "off" schema).
This application is using Hibernate 3.3.2 to update the database. There's Spring 3.0.6 in the mix as well, but not for the database updates. Finally, we're running on Java 1.6.
Can anyone point me in a direction to updating both "on" and "off" schemas simultaneously that does not involve rewriting the whole DAO layer using Spring JDBC to load two separate connection pools? I have not been able to find anything about getting hibernate to do this. Thanks in advance!
You shouldn't be updating two seperate databases this way, especially from the application's point of view. All it should know/care about is whether or not the data is there, not having to mess with two separate databases.
Frankly, this sounds like you may need to purchase an ETL tool. Even if you can't get it to update the 'on' schema from the 'off' one (fast enough to be practical), you will likely be able to use it to keep the two in sync (mirror changes from 'on' to 'off').
HA-JDBC is a replicating JDBC Driver we investigated for a short while. It will automatically replicate all inserts and updates, and distribute all selects. There are other database specific master-slave solutions as well.
On the other hand, I wouldn't recommend doing this for 4-8 hour procedures. Better lock the database before, update one database, and then backup-restore a copy, and then unlock again.