Neo4j Bulk Data - Create Relationship [OutOfMemory Exception] - java

I am using Neo4j Procedure to create relationships on bulk data.
Initially insert that all data using load csv.
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
....
data size is too large[10M] but its successfully executed
my problem is i want to create relationships between this all nodes many-many
but i got exception [OutMemoryException] while executing queries
MATCH(n1:x{REMARKS :"LATEST"}) MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID CREATE (n1)-[:ENROLLER]->(n2) ;
I have already created Indexing and Constraints also
Any idea please help me?

The problem is that your query is performed in one transaction, which leads to the exception [OutMemoryException]. And this is a problem, since at this moment the possibility of periodic transactions only have to load the CSV. So, you can, for example, re-read the CSV after first load:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///XXXX.csv" AS row
MATCH (n1:x{REMARKS :"LATEST", DIST_ID: row.DIST_ID})
WITH n1
MATCH(n2:x{REMARKS :"LATEST"}) WHERE n1.DIST_ID=n2.ENROLLER_ID
CREATE (n1)-[:ENROLLER]->(n2) ;
Or try the trick with periodic committing from the APOC library:
call apoc.periodic.commit("
MATCH (n2:x {REMARKS:'Latest'}) WHERE exists(n2.ENROLLER_ID)
WITH n2 LIMIT {perCommit}
OPTIONAL MATCH (n1:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
WITH n2, collect(n1) as n1s
FOREACH(n1 in n1s|
CREATE (n1)-[:ENROLLER]->(n2)
)
REMOVE n2.ENROLLER_ID
RETURN count(n2)",
{perCommit: 1000}
)
P.S. ENROLLER_ID property is used as a flag for selecting nodes for processing. Of course, you can use another flag, which is set in the processing.
Or a more accurate with apoc.periodic.iterate:
CALL apoc.periodic.iterate("
MATCH (n1:x {REMARKS:'Latest'})
MATCH (n2:x {REMARKS:'Latest'}) WHERE n1.DIST_ID = n2.ENROLLER_ID
RETURN n1,n2
","
WITH {n1} as n1, {n2} as n2
MERGE (n1)-[:ENROLLER]->(n2)
", {batchSize:10000, parallel:true}
)

Related

jOOQ batch insertion inconsistency

While working with batch insertion in jOOQ (v3.14.4) I noticed some inconsistency when looking into PostgreSQL (v12.6) logs.
When doing context.batch(<query>).bind(<1st record>).bind(<2nd record>)...bind(<nth record>).execute() the logs show that the records are actually inserted one by one instead of all in one go.
While doing context.insert(<fields>).values(<1st record>).values(<2nd record>)...values(<nth record>) actually inserts everything in one go judging by the postgres logs.
Is it a bug in the jOOQ itself or was I using the batch(...) functionality incorrectly?
Here are 2 code snippets that are supposed to do the same but in reality, the first one inserts records one by one while the second one actually does the batch insertion.
public void batchInsertEdges(List<EdgesRecord> edges) {
Query batchQuery = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES,
Edges.EDGES.METADATA)
.values((Long) null, (Long) null, (CallSiteRecord[]) null, (JSONB) null)
.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class));
var batchBind = context.batch(batchQuery);
for (var edge : edges) {
batchBind = batchBind.bind(edge.getSourceId(), edge.getTargetId(),
edge.getCallSites(), edge.getMetadata());
}
batchBind.execute();
}
public void batchInsertEdges(List<EdgesRecord> edges) {
var insert = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES, Edges.EDGES.METADATA);
for (var edge : edges) {
insert = insert.values(edge.getSourceId(), edge.getTargetId(), edge.getCallSites(), edge.getMetadata());
}
insert.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class))
.execute();
}
I would appreciate some help to figure out why the first code snippet does not work as intended and second one does. Thank you!
There's a difference between "batch processing" (as in JDBC batch) and "bulk processing" (as in what many RDBMS call "bulk updates").
This page of the manual about data import explains the difference.
Bulk size: The number of rows that are sent to the server in one SQL statement.
Batch size: The number of statements that are sent to the server in one JDBC statement batch.
These are fundamentally different things. Both help improve performance. Bulk data processing does so by helping the RDBMS optimise resource allocation algorithms as it knows it is about to insert 10 records. Batch data processing does so by reducing the number of round trips between client and server. Whether either approach has a big impact on any given RDBMS is obviously vendor specific.
In other words, both of your approaches work as intended.

Extract subgraph in neo4j using cypher query

I'm using neo4j 3.1 with java 8 and I want to extract a connected subgraph as to store it as a test database.
Is it possible to do it and how?
How to do it with the clause Return which returns the output. So, I had to create new nodes and relations or just export the subgraph and put it in a new database.
How can I extract a connected subgraph since I have a disconnected graph.
Thank you
There are two parts to this...getting the connected subgraph, and then finding a means to export.
APOC Procedures seems like it can cover both of these. The approach in this answer using the path expander should get you all the nodes in the connected subgraph (if the relationship type doesn't matter, leave off the relationshipFilter parameter).
The next step is to get all relationships between all of those nodes. APOC's apoc.algo.cover() function in the graph algorithms section should accomplish this.
Something like this (assuming this is after the subgraph query, and subgraphNode is in scope for the column of distinct subgraph nodes):
...
WITH COLLECT(subgraphNode) as subgraph, COLLECT(id(subgraphNode)) as ids
CALL apoc.algo.cover(ids) YIELD rel
WITH subgraph, COLLECT(rel) as rels
...
Now that you have the collections of both the nodes and relationships in the subgraph, you can export them.
APOC Procedures offers several means of exporting, from CSV to CypherScript. You should be able to find an option that works for you.
You can also use the neo4j-shell to extract the result of a query to a file and use this same file to re-import it in the neo4j database :
ikwattro#graphaware-team ~/d/_/310> ./bin/neo4j-shell -c 'dump MATCH (n:Product)-[r*2]->(x) RETURN n, r, x;' > result.cypher
check the file
ikwattro#graphaware-team ~/d/_/310> cat result.cypher
begin
commit
begin
create (_1:`Product` {`id`:"product123"})
create (_2:`ProductInformation` {`id`:"product123EXCEL"})
create (_3:`ProductInformationElement` {`id`:"product123EXCELtitle", `key`:"title", `value`:"Original Title"})
create (_5:`ProductInformationElement` {`id`:"product123EXCELproduct_type", `key`:"product_type", `value`:"casual_bag"})
create (_1)-[:`PRODUCT_INFORMATION`]->(_2)
create (_2)-[:`INFORMATION_ELEMENT`]->(_3)
create (_2)-[:`INFORMATION_ELEMENT`]->(_5)
;
commit
Use this file for feeding another neo4j :
ikwattro#graphaware-team ~/d/_/310> ./bin/neo4j-shell -file result.cypher
Transaction started
Transaction committed
Transaction started
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 4
Relationships created: 3
Properties set: 8
Labels added: 4
52 ms
Transaction committed

How to get myBatis(iBatis) result as Iterable. (in case of very large rows)

When using myBatis, I should get's very large result set from DB and do secuential operation. (such as CSV Export)
I am thinking and afraid that if return type is List, all returned data on my memory and will cause OutOfMemoryException.
So, I want to get result as a type of ResultSet or Iterable<MyObject> using myBatis.
Tell me any solutions.
Starting from mybatis 3.4.1 you can return Cursor which is Iterable and can be used like this (under condition that result is ordered, see above Cursor API java doc for details):
MyEntityMapper.java
#Select({
"SELECT *",
"FROM my_entity",
"ORDER BY id"
})
Cursor<MyEntity> getEntities();
MapperClient.java
MyEntityMapper mapper = session.getMapper(MyEntityMapper.class);
try (Cursor<MyEntity> entities = mapper.getEntities()) {
for (MyEntity entity:entities) {
// process one entity
}
}
You should use fetchSize(refer here). Based on the heap size and data size per row you can select the number of result sets to be fetched from database. Alternatively since basically you are using data export to csv, you can use spring batch which has mybatis paging item reader. Though the draw back in this item reader for each after each page an request is fired to get next page, which increase the load on your database. If you are not worried about the load you can go ahead with paging item reader. or there is simple another item reader called JdbccursorItem reader

spring batch: how to write valid data to one table, and invalid data another table

I have a csv file like
day,cost
20140101, 20
2014-01-5, 20
20140101, ab
so there are some invalid data and I want to load the valid data into table_normal, and invalid data into table_unnormal
so the final data should be
for table of table_normal
day,cost
20140101, 20
for table of table_unnormal
day,cost, reason
2014-01-5, 20, 'invalid day'
20140101, ab,'invalid cost'
I know how to get the reason in processor, but how could be job write to different tables?
I could suggest 3 ways to do this, none of which is very direct and easy.
a) Write you own CustomJDBCItemWriter - you can filter out any ways you want and you should be able to write some record in table table_normal and some record in table_unnormal
b) Use a CompositeItemWriter - both writers will get the "full record list" from the processor. You can then filter out the record needed in each writer. Very similar to (a)
c) If you can do 2 passes over the input - you can write you job in two steps.
Step 1 : Read Records --> Process only bad records --> write to table_unnormal
Step 2 : Read Records --> Process only good records --> write to table_normal
There isn't a good in-build feature to handle this scenario in spring batch directly (at-least none I am aware off)

produce hfiles for multiple tables to bulk load in a single map reduce

I am using mapreduce and HfileOutputFormat to produce hfiles and bulk load them directly into the hbase table.
Now, while reading the input files, I want to produce hfiles for two tables and bulk load the outputs in a single mapreduce.
I searched the web and see some links about MultiHfileOutputFormat and couldn't find a real solution to that.
Do you think that it is possible?
My way is :
use HFileOutputFormat as well, when the job is completed , doBulkLoad, write into table1.
set a List puts in mapper, and a MAX_PUTS value in global.
when puts.size()>MAX_PUTS, do:
String tableName = conf.get("hbase.table.name.dic", table2);
HTable table = new HTable(conf, tableName);
table.setAutoFlushTo(false);
table.setWriteBufferSize(1024*1024*64);
table.put(puts);
table.close();
puts.clear();
notice:you mast have a cleanup function to write the left puts .

Categories