Java heap space errors using bigger amounts of data in neo4j - java

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).
These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!
To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:
#Test
#Transactional
#Rollback
public void generateUngroupedNode()
{
long numberOfLeafs = 1000000;
Assert.assertTrue(this.template.transactionIsRunning());
Node root = this.template.createNode(map(NAME, UNGROUPED));
String groupingKey = null;
for (long index = 0; index < numberOfLeafs; index++)
{
// Just a sample division of leafs to possible groups
// Creates keys to be grouped by to groups containing 2 leafs each
if (index % 2 == 0)
{
groupingKey = UUID.randomUUID().toString();
}
Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
map());
}
}
For this test I use the gcr cache to avoid Garbage Collector issues:
cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M
Additionally I set my MAVEN_OPTS to:
export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"
But anyway when running that test I always get a Java heap space error:
java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
at java.lang.Class.getMethod0(Class.java:2670)
at java.lang.Class.getMethod(Class.java:1603)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)
I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:
50000 leafs: 3035ms
100000 leafs: 4290ms
200000 leafs: 10268ms
400000 leafs: 20913ms
800000 leafs: Java heap space
Here is a screenshot of the system monitor during those operations:
To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:
Heap usage:
CPU usage:
The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.
It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.

By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:
Output: import of 1000000 children took 26 seconds.
public class CreateManyRelationships {
public static final int COUNT = 1000 * 1000;
public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
public static final File DIRECTORY = new File("target/test.db");
public static void main(String[] args) throws IOException {
FileUtils.deleteRecursively(DIRECTORY);
GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
long time=System.currentTimeMillis();
Transaction tx = gdb.beginTx();
Node root = gdb.createNode();
for (int i=1;i<= COUNT;i++) {
Node child = gdb.createNode();
root.createRelationshipTo(child, CHILD);
if (i % 50000 == 0) {
tx.success();tx.finish();
tx = gdb.beginTx();
}
}
tx.success();tx.finish();
time = System.currentTimeMillis()-time;
System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
gdb.shutdown();
}
}
And Spring Data Neo4j docs state, that it is not made for this type of task

If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.
For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.
Would that help?

Related

Dataflow CoGroupByKey is very slow for more than 10000 elements per key

I have two PCollection<KV<String, TableRow>> one has ~7 Million rows and the other has ~1 Million rows.
What I want to do is apply left outer join between these two PCollections and in case of successful join put all the data of right TableRow To left TableRow and return the results.
I have tried using CoGroupByKey in apache beam SDK 2.10.0 for java and here I am getting so many Hot Keys so my Fetching Result after CoGrupByKey is getting slower with Waring 'More 10000 elements per key, need to reiterate'. I have also tried shuffle mode service but no help.
PCollection<TableRow> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, TableROw>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all collection 1 values
Iterable<TableRow> pt1Vals = e.getValue().getAll(t1);
Iterable<TableRow> pt2Vals = e.getValue().getAll(t2);
for (TableRow tr : pt1Vals)
{
TableRow out = tr.clone();
if(pt2Vals.iterator().hasNext())
{
for (TableRow tr1 : pt2Vals)
{
out.putAll(tr1);
c.output(out);
}
}
else
{
c.output(out);
}
}
}
}));
What is the way to perform these type of joins in dataflow?
I have made some research and I have found some information that can help you.
The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated, I don't know if you have it in a different structure.
Besides you can redistribute your keys, put less workers and add more memory, or, try using Combine.perKey.
Also you can try this workaround, or, you can read this article and have more information that can help you.

Cassandra, Java and MANY Async request : is this good?

I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you
Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.
I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )

DFS performance of Spark GraphX vs simple Java DFS implementation

Considering a graph with 14,000 vertices and 14,000 edges, I wonder why GraphX takes much more time than the java implementation of a graph to get all the paths from a vertex to the leaf?
The java implementation: A few seconds
The Graphx implementation: Several minutes
Is spark GraphX really suitable for this kind of treatment?
My system:
i5-7500 #3.40GHz,
8GB RAM
The pregel's algorythm:
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
It happened to me when implementing some algorithms on Graphx, I believe that GraphX is well adapted for a distributed environment when you have big graphs split accross multiple machines.
But now while you say that you use one node, have you checked the number of workers used? number of executors? Amount of memory used by each excutor? These configuration parameters definitely plays an important role in increasing or decreasing the performance of your GraphX application.

Check for substring efficiently for large data sets

I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra
A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR

How could a distributed queue-like-thing be implemented on top of a RBDMS or NOSQL datastore or other messaging system (e.g., rabbitmq)?

From the wouldn't-it-be-cool-if category of questions ...
By "queue-like-thing" I mean supports the following operations:
append(entry:Entry) - add entry to tail of queue
take(): Entry - remove entry from head of queue and return it
promote(entry_id) - move the entry one position closer to the head; the entry that currently occupies that position is moved in the old position
demote(entry_id) - the opposite of promote(entry_id)
Optional operations would be something like:
promote(entry_id, amount) - like promote(entry_id) except you specify the number of positions
demote(entry_id, amount) - opposite of promote(entry_id, amount)
of course, if we allow amount to be positive or negative, we can consolidate the promote/demote methods with a single move(entry_id, amount) method
It would be ideal if the following operations could be performed on the queue in a distributed fashion (multiple clients interacting with the queue):
queue = ...
queue.append( a )
queue.append( b )
queue.append( c )
print queue
"a b c"
queue.promote( b.id )
print queue
"b a c"
queue.demote( a.id )
"b c a"
x = queue.take()
print x
"b"
print queue
"c a"
Are there any data stores that are particularly apt for this use case? The queue should always be in a consistent state even if multiple users are modifying the queue simultaneously.
If it weren't for the promote/demote/move requirement, there wouldn't be much of a problem.
Edit:
Bonus points if there are Java and/or Python libraries to accomplish the task outlined above.
Solution should scale extremely well.
Redis supports lists and ordered sets: http://redis.io/topics/data-types#lists
It also supports transactions and publish/subscribe messaging. So, yes, I would say this can be easily done on redis.
Update: In fact, about 80% of it has been done many times: http://www.google.co.uk/search?q=python+redis+queue
Several of those hits could be upgraded to add what you want. You would have to use transactions to implement the promote/demote operations.
It might be possible to use lua on the server side to create that functionality, rather than having it in client code. Alternatively, you could create a thin wrapper around redis on the server, that implements just the operations you want.
Python: "Batteries Included"
Rather than looking to a data store like RabbitMQ, Redis, or an RDBMS, I think python and a couple libraries have more than enough to solve this problem. Some may complain that this do-it-yourself approach is re-inventing the wheel but I prefer running a hundred lines of python code over managing another data store.
Implementing a Priority Queue
The operations that you define: append, take, promote, and demote, describe a priority queue. Unfortunately python doesn't have a built-in priority queue data type. But it does have a heap library called heapq and priority queues are often implemented as heaps. Here's my implementation of a priority queue meeting your requirements:
class PQueue:
"""
Implements a priority queue with append, take, promote, and demote
operations.
"""
def __init__(self):
"""
Initialize empty priority queue.
self.toll is max(priority) and max(rowid) in the queue
self.heap is the heap maintained for take command
self.rows is a mapping from rowid to items
self.pris is a mapping from priority to items
"""
self.toll = 0
self.heap = list()
self.rows = dict()
self.pris = dict()
def append(self, value):
"""
Append value to our priority queue.
The new value is added with lowest priority as an item. Items are
threeple lists consisting of [priority, rowid, value]. The rowid
is used by the promote/demote commands.
Returns the new rowid corresponding to the new item.
"""
self.toll += 1
item = [self.toll, self.toll, value]
self.heap.append(item)
self.rows[self.toll] = item
self.pris[self.toll] = item
return self.toll
def take(self):
"""
Take the highest priority item out of the queue.
Returns the value of the item.
"""
item = heapq.heappop(self.heap)
del self.pris[item[0]]
del self.rows[item[1]]
return item[2]
def promote(self, rowid):
"""
Promote an item in the queue.
The promoted item swaps position with the next highest item.
Returns the number of affected rows.
"""
if rowid not in self.rows: return 0
item = self.rows[rowid]
item_pri, item_row, item_val = item
next = item_pri - 1
if next in self.pris:
iota = self.pris[next]
iota_pri, iota_row, iota_val = iota
iota[1], iota[2] = item_row, item_val
item[1], item[2] = iota_row, iota_val
self.rows[item_row] = iota
self.rows[iota_row] = item
return 2
return 0
The demote command is nearly identical to the promote command so I'll omit it for brevity. Note that this depends only on python's lists, dicts, and heapq library.
Serving our Priority Queue
Now with the PQueue data type, we'd like to allow distributed interactions with an instance. A great library for this is gevent. Though gevent is relatively new and still beta, it's wonderfully fast and well tested. With gevent, we can setup a socket server listening on localhost:4040 pretty easily. Here's my server code:
pqueue = PQueue()
def pqueue_server(sock, addr):
text = sock.recv(1024)
cmds = text.split(' ')
if cmds[0] == 'append':
result = pqueue.append(cmds[1])
elif cmds[0] == 'take':
result = pqueue.take()
elif cmds[0] == 'promote':
result = pqueue.promote(int(cmds[1]))
elif cmds[0] == 'demote':
result = pqueue.demote(int(cmds[1]))
else:
result = ''
sock.sendall(str(result))
print 'Request:', text, '; Response:', str(result)
if args.listen:
server = StreamServer(('127.0.0.1', 4040), pqueue_server)
print 'Starting pqueue server on port 4040...'
server.serve_forever()
Before that runs in production, you'll of course want to do some better error/buffer handling. But it'll work just fine for rapid-prototyping. Notice that this doesn't require any locking around the pqueue object. Gevent doesn't actually run code in parallel, it just gives that impression. The drawback is that more cores won't help but the benefit is lock-free code.
Don't get me wrong, the gevent SocketServer will process multiple requests at the same time. But it switches between answering requests through cooperative multitasking. This means you have to yield the coroutine's time slice. While gevents socket I/O functions are designed to yield, our pqueue implementation is not. Fortunately, the pqueue completes it's tasks really quickly.
A Client Too
While prototyping, I found it useful to have a client as well. It took some googling to write a client so I'll share that code too:
if args.client:
while True:
msg = raw_input('> ')
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(msg)
text = sock.recv(1024)
sock.close()
print text
To use the new data store, first start the server and then start the client. At the client prompt you ought to be able to do:
> append one
1
> append two
2
> append three
3
> promote 2
2
> promote 2
0
> take
two
Scaling Extremely Well
Given your thinking about a data store, it seems you're really concerned with throughput and durability. But "scale extremely well" doesn't quantify your needs. So I decided to benchmark the above with a test function. Here's the test function:
def test():
import time
import urllib2
import subprocess
import random
random = random.Random(0)
from progressbar import ProgressBar, Percentage, Bar, ETA
widgets = [Percentage(), Bar(), ETA()]
def make_name():
alphabet = 'abcdefghijklmnopqrstuvwxyz'
return ''.join(random.choice(alphabet)
for rpt in xrange(random.randrange(3, 20)))
def make_request(cmds):
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(cmds)
text = sock.recv(1024)
sock.close()
print 'Starting server and waiting 3 seconds.'
subprocess.call('start cmd.exe /c python.exe queue_thing_gevent.py -l',
shell=True)
time.sleep(3)
tests = []
def wrap_test(name, limit=10000):
def wrap(func):
def wrapped():
progress = ProgressBar(widgets=widgets)
for rpt in progress(xrange(limit)):
func()
secs = progress.seconds_elapsed
print '{0} {1} records in {2:.3f} s at {3:.3f} r/s'.format(
name, limit, secs, limit / secs)
tests.append(wrapped)
return wrapped
return wrap
def direct_append():
name = make_name()
pqueue.append(name)
count = 1000000
#wrap_test('Loaded', count)
def direct_append_test(): direct_append()
def append():
name = make_name()
make_request('append ' + name)
#wrap_test('Appended')
def append_test(): append()
...
print 'Running speed tests.'
for tst in tests: tst()
Benchmark Results
I ran 6 tests against the server running on my laptop. I think the results scale extremely well. Here's the output:
Starting server and waiting 3 seconds.
Running speed tests.
100%|############################################################|Time: 0:00:21
Loaded 1000000 records in 21.770 s at 45934.773 r/s
100%|############################################################|Time: 0:00:06
Appended 10000 records in 6.825 s at 1465.201 r/s
100%|############################################################|Time: 0:00:06
Promoted 10000 records in 6.270 s at 1594.896 r/s
100%|############################################################|Time: 0:00:05
Demoted 10000 records in 5.686 s at 1758.706 r/s
100%|############################################################|Time: 0:00:05
Took 10000 records in 5.950 s at 1680.672 r/s
100%|############################################################|Time: 0:00:07
Mixed load processed 10000 records in 7.410 s at 1349.528 r/s
Final Frontier: Durability
Finally, durability is the only problem I didn't completely prototype. But I don't think it's that hard either. In our priority queue, the heap (list) of items has all the information we need to persist the data type to disk. Since, with gevent, we can also spawn functions in a multi-processing way, I imagined using a function like this:
def save_heap(heap, toll):
name = 'heap-{0}.txt'.format(toll)
with open(name, 'w') as temp:
for val in heap:
temp.write(str(val))
gevent.sleep(0)
and adding a save function to our priority queue:
def save(self):
heap_copy = tuple(self.heap)
toll = self.toll
gevent.spawn(save_heap, heap_copy, toll)
You could now copy the Redis model of forking and writing the data store to disk every few minutes. If you need even greater durability then couple the above with a system that logs commands to disk. Together, those are the AFP and RDB persistence methods that Redis uses.
Websphere MQ can do almost all of this.
The promote/demote is almost possible, by removing the message from the queue and putting it back on with a higher/lower priority, or, by using the "CORRELID" as a sequence number.
What's wrong with RabbitMQ? It sounds exactly like what you need.
We extensively use Redis as well in our Production environment, but it doesn't have some of the functionality Queues usually have, like setting a task as complete, or re-sending the task if it isn't completed in some TTL. It does, on the other hand, have other features a Queue doesn't have, like it is a generic storage, and it is REALLY fast.
Use Redisson it implements familiar List, Queue, BlockingQueue, Deque java interfaces in distributed approach provided by Redis. Example with a Deque:
Redisson redisson = Redisson.create();
RDeque<SomeObject> queue = redisson.getDeque("anyDeque");
queue.addFirst(new SomeObject());
queue.addLast(new SomeObject());
SomeObject obj = queue.removeFirst();
SomeObject someObj = queue.removeLast();
redisson.shutdown();
Other samples:
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#77-list
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#78-queue https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#710-blocking-queue
If you for some reason decide to use an SQL database as a backend, I would not use MySQL as it requires polling (well and would not use it for lots of other reasons), but PostgreSQL supports LISTEN/NOTIFY for signalling other clients so that they do not have to poll for changes. However, it signals all listening clients at once, so you still would require a mechanism for choosing a winning listener.
As a sidenote I am not sure if a promote/demote mechanism would be useful; it would be better to schedule the jobs appropriately while inserting...

Categories