Running Neo4j purely in memory without any persistence

Running Neo4j purely in memory without any persistence - java

I don't want to persist any data but still want to use Neo4j for it's graph traversal and algorithm capabilities. In an embedded database, I've configured cache_type = strong and after all the writes I set the transaction to failure. But my write speeds (node, relationship creation speeds) are a slow and this is becoming a big bottleneck in my process.
So, the question is, can Neo4j be run without any persistence aspects to it at all and just as a pure API? I tried others like JGraphT but those don't have traversal mechanisms like the ones Neo4j provides.

As far as I know, Neo4J data storage and Lucene indexes are always written to files. On Linux, at least, you could set up a ramfs filing system to hold the files in-memory.
See also:
Loading all Neo4J db to RAM

How many changes do you group in each transaction? You should try to group up to thousands of changes in each transaction since committing a transaction forces the logical log to disk.
However, in your case you could instead begin your transactions with:
db.tx().unforced().begin();
Instead of:
db.beginTx();
Which makes that transaction not wait for the logical log to force to disk and makes small transactions much faster, but a power outage could have you lose the last couple of seconds of data potentially.
The tx() method sits on GraphDatabaseAPI, which for example EmbeddedGraphDatabase implements.

you can try a virtual drive. It would make neo4j persist to the drive, but it would all happen in memory
https://thelinuxexperiment.com/create-a-virtual-hard-drive-volume-within-a-file-in-linux/

Related

Loading and persisting Hazelcast distributed map

I am using Hazelcast(3.8.1) as a cache service in my application. After going through the hazelcast documentation, I have few doubts related to it:
If we use Write-Behind Persistence, this being an async calls to its local queue, from which eventually we persist it to a db.
My question is, if all the nodes go down, then will there be a data loss in this scenario?
Note: I understand that one copy of the queue is also being maintained in a back up node. But my scenario is when all the node goes down, can we lose data?
Does hazelcast maintain an offline persistence when it goes down and load it when it is started [for all the nodes]?
Appreciate responses.

The answer to 1 is obvious, and is applicable to any in-memory system with asynchronous writes. If all nodes in your cluster go down, then yes, there's potential for data loss as your system is only eventually consistent.
For question 2: Hazelcast is an in-memory cache and therein lie its primary benefits. Writing to or loading from persistent storage should be secondary because it conflicts with some of the main attributes of a caching system (speed, I guess...).
With that said, it allows you to load from and write to persistent storage, either synchronously (write-through) or asynchronously (write-behind)
If your main reason for using Hazelcast is replication and partitioning (of persistent, consistent data), then you'd be better off using a NoSql database such as Mongodb. This depends a lot on your usage patters because it may still make sense if you expect far more reads than writes.
If, on the other hand, your main reason for using it is speed, then what you need is to better manage fault-tolerance, which has more to do with your cluster topology (maybe you should have cross-datacenter replication) than with persistence. It's atypical to be concerned with "all nodes dying" in your DC unless you have strong consistency or transaction requirements.

Yes, you would lose the data in memory if it is not persisted to the database yet.
OTOH, Hazelcast has Hot Restart for persistence to disk in Enterprise version. This helps in case of a planned shutdown of whole cluster or a sudden cluster-wide crash, e.g., power outage.

Keeping neo4j updated with production MSSQL

I'm investigating the possibility of using neo4j to handle some of the queries of our java web application that simply take too long to run on MSSQL as they require so many joins on large tables, even with indexes implemented.
I am however concerned about the time that it might take to complete the ETL ultimately impacting on how outdated the information may be when queries.
Can someone advise on either a production strategy or toolkit / library that can assist in reading a production sql-server database (using deltas if possible to optimise) and updating a running instance of a neo4j database? I imagine that there will have to be some kind of mapping configuration but the idea is to have this run in an automated manner, updating the neo4j database with one or more sql-server table or view contents.

The direct way to connect a MS SQL database to a Neo4j database would be using the apoc.load.jdbc procedure.
For an initial load you can use Neo4j ETL (https://neo4j.com/blog/rdbms-neo4j-etl-tool/).
There is however no way around the fact that some planning and work will be involved if you want to keep two databases in sync (and if the logic involved goes beyond a few simple queries) continiously. You might want to offload a delta every so often (monthly, daily, hourly, ...) into CSV files and load those (with CYPHER syntax determining what needs to be added, removed, changed or connected) with LOAD CSV.
Sadly enough there's no such thing as a free lunch.
Hope this helps,
Tom

Cache in a distributed web application - complex queries use case

We are developing a distributed web application (3 tomcats with a load balancer).
Currently we are looking for a cache solution. This solution should be cluster safe ofcourse.
We are using spring, jpa (mysql)
We thought about the following solution :
Create a cache server that runs a simple cache and all DB operations from each tomcat will be delegated to it. (dao layer in web app will communicate with that server instead of accessing DB itself). This is appealing since the cache on the cache server configuration can be minimal.
What we are wondering about right now is:
If a complex query is passed to the cacheServer (i.e. select with multiple joins and where clauses) how exactly the standard cache form (map) can handle this? does it mean we have to single handedly implement a lookup for each complex query and adjust it to map search instead of DB?
P.S - there is a possibility that this architecture is flawed in its base and therefore a weird question like this was raised, if that's the case please suggest an alternative.
Best,

mySql already come with a query cache, see http://dev.mysql.com/doc/refman/5.1/en/query-cache-operation.html

If I understand correctly, you are trying to implement a method cache, using as a key the arguments of your DAO methods and as value, the resulted object/list.
This should work, but your concern about complex queries is valid, you will end up with a lot of entries in your cache. For a complex query you would hit the cache only if the same query is executed exactly with the same arguments as the one in the cache. You will have to figure out if it is useful to cache those complex queries, if there is a chance they will be hit, it really depends on the application business logic.
Another option would be to implement a cache with multiple levels: second level cache and query cache, using ehcache and big memory. You might find this useful:
http://ehcache.org/documentation/integrations/hibernate

What is the best way to write a data loader?

I am using Spring 2.5 and the Hibernate that goes with it. I'm running against an Oracle 11g database.
I have created my DAOs which extend HibernateTemplate. Now I want to write a loader that inserts 5 million rows in my person table. I have written this in a simple minded fashion like read a row from a CSV file, turn it into a person, save into the table. Keep doing this until CSV file is empty.
The problem is that I run out of heap space at about 450000 rows. So I double the size of memory from 1024m to 2048m and now I run out of memory after about 900000 rows.
Hmmmmm....
So I've read some things about turning off the query cache for Hibernate, but I'm not using a L2 cache, so I don't think this is the issue.
I've read some things about JDBC2 batching, but I don't think that applies to hibernate.
So, I'm wondering if maybe there's a fundamental thing about Hibernate that I'm missing.

To be honest I wouldn't be using hibernate for that. ORMs are not designed to load million of rows into DBs. Not saying that you can't, but it's a bit like digging a swimming pool with a electric drill; you'd use an excavator for that, not a drill.
In your case, I'd load the CSV directly to the DB with a loader application that comes with databases. If you don't want to do that, yes, batch inserts will be way more efficient. I don't think Hibernate let's you do that easily though. If I were you I'd just use plain JDBC, or at most Spring JDBC.
If you have complicated businesslogic in the entities and absolutely have to use Hibernate, you could flush every N records as Richard suggests. However, I'd consider that a pretty bad hack.

In my experience with EclipseLink, holding a single transaction open while inserting/updating many records results in the symptoms you've experienced.
You are working with an EntityManager (of some sort, JPA or Hybernate specific - it's still managing Entitys). It's trying to keep the working set in memory, for the life of the transaction.
A general solution was to commit & the restart the transaction after every N inserts; a typical N for me was 1000.
As a footnote, with some version (undefined, it's been a few years) of EclipseLink, a session flush/clear didn't solve the problem.

It sounds like you are running out of space due to your first-level cache (the Hibernate session). You can flush the Hibernate session periodically to keep memory usage down, and break up the work into chunks by committing every few thousand rows, keeping the database's transaction log from getting too big.
But using Hibernate for a load task like that will be slow, because JDBC is slow. If you have a good idea what the environment will be like, you have a cap on the amount of data, and you have a big enough window for processing, then you can manage, but in a situation where you want it to work in multiple different client sites and you want to minimize the time spent on figuring out problems due to some client site's load job not working, then you should go with the database's bulk-copy tool.
The bulk-copy approach means the database suspends all constraint checking and index-building and transaction logging, instead it concentrates on slurping the data in as fast as possible. Because JDBC doesn't get anything like this level of cooperation from the database it can't compete. At a previous job we replaced a JDBC loader task that took over 8 hours to run with a SQLLoader task that took 20 minutes.
You do sacrifice database independence, but all databases have a bulk-copy tool (because DBAs rely on them) so you will have a very similar process for each database, only the exe you invoke and the way the file formatting is specified should change. And this way you make the best use of your processing window.

What is the best way to serialize an EMF model instance?

I have an Eclipse RCP application with an instance of an EMF model populated in memory. What is the best way to store that model for external systems to access? Access may occur during and after run time.
Reads and writes of the model are pretty balanced and can occur several times a second.
I think a database populated using Hibernate + Teneo + EMF would work nicely, but I want to know what other options are out there.

I'm using CDO (Connected Data Objects) in conjunction with EMF to do something similar. If you use the examples in the Eclipse wiki, it doesn't take too long to get it running. A couple of caveats:
For data that changes often, you probably will want to use nonAudit mode for your persistence. Otherwise, you'll save a new version of your EObject with every commit, retaining the old ones as well.
You can choose to commit every time your data changes, or you can choose to commit at less frequent intervals, depending on how frequently you need to publish your updates.
You also have fairly flexible locking options if you choose to do so.
My application uses Derby for persistence, though it will be migrated to SQL Server before long.
There's a 1 hour webinar on Eclipse Live (http://live.eclipse.org/node/635) that introduces CDO and gives some good examples of its usage.

I'd go with Teneo to do the heavy lifting unless performance is a real problem (which it won't be unless your models are vast). Even if it is slow you can tune it using JPA annotations.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.