Neo4j super slow edge insert

Neo4j super slow edge insert - java

Such a simple thing but... I am following https://neo4j.com/docs/developer-manual/current/cypher/clauses/create/#create-create-a-relationship-between-two-nodes but On Neo4j, the following Cypher query takes more than 120 seconds:
MATCH (from:PubmedDocumentNode), (to:PubmedAuthorNode)
WHERE from.PMID = 26408320
AND to.Author_Name = "Bando|Mika|M|"
CREATE (from)-[:AUTHOR {labels:[],label:[3.0],Id:0}]->(to)
There are also indexes:
Indexes
ON :PubmedAuthorNode(Author_Name) ONLINE
ON :PubmedDocumentNode(PMID) ONLINE
No constraints
so.. why??
EDIT: The same query without the create part runs in no time.

Related

jOOQ batch insertion inconsistency

While working with batch insertion in jOOQ (v3.14.4) I noticed some inconsistency when looking into PostgreSQL (v12.6) logs.
When doing context.batch(<query>).bind(<1st record>).bind(<2nd record>)...bind(<nth record>).execute() the logs show that the records are actually inserted one by one instead of all in one go.
While doing context.insert(<fields>).values(<1st record>).values(<2nd record>)...values(<nth record>) actually inserts everything in one go judging by the postgres logs.
Is it a bug in the jOOQ itself or was I using the batch(...) functionality incorrectly?
Here are 2 code snippets that are supposed to do the same but in reality, the first one inserts records one by one while the second one actually does the batch insertion.
public void batchInsertEdges(List<EdgesRecord> edges) {
Query batchQuery = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES,
Edges.EDGES.METADATA)
.values((Long) null, (Long) null, (CallSiteRecord[]) null, (JSONB) null)
.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class));
var batchBind = context.batch(batchQuery);
for (var edge : edges) {
batchBind = batchBind.bind(edge.getSourceId(), edge.getTargetId(),
edge.getCallSites(), edge.getMetadata());
}
batchBind.execute();
}
public void batchInsertEdges(List<EdgesRecord> edges) {
var insert = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES, Edges.EDGES.METADATA);
for (var edge : edges) {
insert = insert.values(edge.getSourceId(), edge.getTargetId(), edge.getCallSites(), edge.getMetadata());
}
insert.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class))
.execute();
}
I would appreciate some help to figure out why the first code snippet does not work as intended and second one does. Thank you!

There's a difference between "batch processing" (as in JDBC batch) and "bulk processing" (as in what many RDBMS call "bulk updates").
This page of the manual about data import explains the difference.
Bulk size: The number of rows that are sent to the server in one SQL statement.
Batch size: The number of statements that are sent to the server in one JDBC statement batch.
These are fundamentally different things. Both help improve performance. Bulk data processing does so by helping the RDBMS optimise resource allocation algorithms as it knows it is about to insert 10 records. Batch data processing does so by reducing the number of round trips between client and server. Whether either approach has a big impact on any given RDBMS is obviously vendor specific.
In other words, both of your approaches work as intended.

Spark join/groupby datasets take a lot time

I have 2 datasets(tables) with 35kk+ rows.
I try to join(or group by) this datasets by some id. (in common it will be one-to-one)
But this operation takes a lot time: 25+ h.
Filters only works fine: ~20 mins.
Env: emr-5.3.1
Hadoop distribution:Amazon
Applications:Ganglia 3.7.2, Spark 2.1.0, Zeppelin 0.6.2
Instance type: m3.xlarge
Code (groupBy):
Dataset<Row> dataset = ...
...
.groupBy("id")
.agg(functions.min("date"))
.withColumnRenamed("min(date)", "minDate")
Code (join):
...
.join(dataset2, dataset.col("id").equalTo(dataset2.col("id")))
Also I found this message in EMR logs:
HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate.

There Might be a possibility of Data getting Skewed. We faced this. Check your joining column. This happens mostly if your joining column has NULLS
Check Data Stored pattern with :
select joining_column, count(joining_col) from <tablename>
group by joining_col
This will give you an idea that whether the data in your joining column is Evenly distributed

Twitter 4j Query and Filtered Query

I'm searching some specific list of keywords from Twitter.
The following code snippets are working:
FilterQuery fq = new FilterQuery();
fq.track("keyword1", "keyword2", "keyword3", "keyword4");
twitterStream.filter(fq);
But i need to put since and until functions to make a search for specific time interval. To do this i have to define my query like ;
Query query = new Query("(keyword1) OR (keyword2) OR (keyword3) OR (keyword4)");
query.setSince("20110101");
query.setUntil("20160210");
How can i change my FilteredQuery to Query instance to have access to add since and until methods.
Moreover i have to make same for the statement of twitterStream.filter(fq);
Because this statement does not accept query as an input parameter, like (twitterStream.filter(query))
Thank you very much for your interests..

After making some search among this issue i found that with using FilteredQuery i can just use keyword for listening new tweets. And also by using Query with date interval i can only get tweets for 7-10 days ago. Moreover Twitter Company gives an access to Gnip Software which sells 1 million tweets over a 40 day period and start at $1,250.

Why is a simple Cypher query so slow?

Using Neo4j 2.3.0 Community Edition with Oracle JDK 8 and Windows 7
I am new to Neo4j and just trying how it works with Java. In the Neo4j Browser I created 3 nodes with the following statement:
CREATE (c:Customer {name:'King'})-[:CREATES]->(:Order {status:'created'}),
(c)-[:CREATES]->(:Order {status:'created'})
Executed from the Neo4j Browser, the following query returns in 200 ms:
MATCH (c:Customer)-[:CREATES]->(o:Order)
WHERE c.name = 'King'
RETURN o.status
Executing this in Eclipse takes about 2500 ms, sometimes up to 3000 ms:
String query = "MATCH (c:Customer)-[:CREATES]->(o:Order) "
+ "WHERE c.name = 'King' "
+ "RETURN o.status";
Result result = db.execute(query);
This is incredibly slow! What am I doing wrong?
In addition, I ran the following snippet in Eclipse and it only took about 50 ms:
Node king = db.findNode(NodeType.Customer, "name", "King");
Iterable<Relationship> kingRels = king.getRelationships(RelType.CREATES);
for(Relationship rel : kingRels) {
System.out.println(rel.getEndNode().getProperty("status"));
}
So there are actually two things I am suprised of:
Running a Cypher query in the Neo4j Browser seems to be way slower than doing a comparable thing with the Neo4j Core Java API in Eclipse.
Running a Cypher query "embedded" in Java code is incredibly slow compared to the Neo4j Browser solution as well as compared to the plain Java solution.
I am pretty sure that this cannot be true. So what am I doing wrong?

How do you measure it? if you measure the full runtime, then your time includes, jvm startup, database startup and class-loading and loading of store-files from disk.
Remember in the browser all of that is already running, and warmed up etc.
If you really want to measure your query, run it a number of times to warm up and then measure only the query execution and result loading.
Also consider using indexes or constraints as indicated and parameters, e.g. for your customer.name.

How to improve performance of retrieving a REF CURSOR into Java using Spring?

I am performing a call to a function which is part of a DB package. This package is deployed in two locations. One local and another remote (across the Atlantic).
I am retrieving the data via the Spring JDBC template.
There is one function which returns approximately 1000 rows (not all that much) and this is taking about 1.5 seconds when getting the data locally but it's taking in the region of 12 seconds when getting the data remotely.
In all sample code, names have been changed and code has been simplified a little.
Please see an example of the current Java code:
SimpleJdbcCall simpleJdbcCall = new SimpleJdbcCall(getDataSource())
.withSchemaName(MY_SCHEMA_NAME)
.withCatalogName("REFCURSOR_PKG")
.withFunctionName("GET_DATA")
.returningResultSet("RESULT_SET", new DataEntryMapper());
SqlParameterSource params = new MapSqlParameterSource()
.addValue("the_name", name)
.addValue("the_rev", rev);
Map resultSet = simpleJdbcCall.execute(params);
ArrayList list = (ArrayList) resultSet.get("RESULT_SET");
The RowMapper class looks something like this:
class RouteDataEntryMapper implements RowMapper {
public RouteDataEntry mapRow(ResultSet resultSet, int rowNum) throws SQLException {
return new DataEntry(resultSet.getString("name"),
Integer.parseInt(resultSet.getString("rev"));
}
}
SQL package spec snippet:
TYPE REF_CURSOR IS REF CURSOR;
SQL function:
FUNCTION GET_ROUTE_DATA(the_name VARCHAR2, the_rev VARCHAR2) RETURN REF_CURSOR AS
RESULT_SET REF_CURSOR;
BEGIN
OPEN RESULT_SET FOR
select *
from table_name tn
where tn.name = the_name
and tn.rev = the_rev;
RETURN RESULT_SET;
CLOSE RESULT_SET;
EXCEPTION WHEN OTHERS THEN
RAISE;
END GET_ROUTE_DATA;
I have tried using regular boiler plate JDBC also (create connection, prepare statement, execute statement, retrieve data from RESULT_SET, etc) and I found that the vast majority of time was spent looping over the RESULT_SET and extracting the data out of it and into some pojos. In the case of the Spring code above, most of the time was spent during the execute() method but this is probably because it creates the objects using the RowMapper at that time.
So, the common area between them is the performing of actions such as:
rs.getString("name")
and I'm guessing that this is where the problem lies but I could be wrong.
As I said, locally the delay is fine but remotely it's taking way too long. Is this because it's going to the DB on every rs.get... ? Is there a better way to do this?
Thanks in advance.

rs.getString("name")
ResultSet.get*(String columnName) can be replaced with ResultSet.get*(int columnNaumber) which is slightly faster but I doubt that the main problem here.
Is this because it's going to the DB on every rs.get... ?
While it really depends the driver I suspect it won't. For a cached result-set it might go to ther server when your scroll through the cursor but it would still fetch a bunch of rows in every roundtrip.
Two more suggestions I have are:
Use a network sniffing utility to see the data being transferred
Check your driver for any option to pre-fetch and such like.

add this line :-
.withoutProcedureColumnMetaDataAccess
in the following code lines
SimpleJdbcCall simpleJdbcCall = new SimpleJdbcCall(getDataSource())
.withSchemaName(MY_SCHEMA_NAME)
.withCatalogName("REFCURSOR_PKG")
.withFunctionName("GET_DATA")
.withoutProcedureColumnMetaDataAccess // to avoid fetching meta data info from database

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Neo4j super slow edge insert - java

Related

jOOQ batch insertion inconsistency

Spark join/groupby datasets take a lot time

Twitter 4j Query and Filtered Query

Why is a simple Cypher query so slow?

How to improve performance of retrieving a REF CURSOR into Java using Spring?

Categories

Resources