Is CSV the only options to speed up my bulk relationships creation?
I read many articles in internet, and they all are telling about CSV. CSV will definitely give me a performance boost (could you suppose how big?), but I'm not sure I can store data in CSV format. Any other options? How much I will get from using Neo4J 3 BOLT protocol?
My program
I'm using Neo4j 2.1.7. I try to create about 50000 relationships at once. I execute queries in batch of size 10000, and it takes about 120-140 seconds to insert all 50000.
My query looks like:
MATCH (n),(m)
WHERE id(n)=5948 and id(m)=8114
CREATE (n)-[r:MY_REL {
ID:"4611686018427387904",
TYPE: "MY_REL_1"
PROPERTY_1:"some_data_1",
PROPERTY_2:"some_data_2",
.........................
PROPERTY_14:"some_data_14"
}]->(m)
RETURN id(n),id(m),r
As it is written in the documentation:
Cypher supports querying with parameters. This means developers don’t
have to resort to string building to create a query. In addition to
that, it also makes caching of execution plans much easier for Cypher.
So, you need pack your data as parameters and pass with cypher query:
UNWIND {rows} as row
MATCH (n),(m)
WHERE id(n)=row.nid and id(m)=row.mid
CREATE (n)-[r:MY_REL {
ID:row.relId,
TYPE:row.relType,
PROPERTY_1:row.someData_1,
PROPERTY_2:row.someData_2,
.........................
PROPERTY_14:row.someData_14
}]->(m)
RETURN id(n),id(m),r
Related
I am looking for a way how to process a large amount of data that are loaded from the database in a reasonable time.
The problem I am facing is that I have to read all the data from the database (currently around 30M of rows) and then process them in Java. The processing itself is not the problem but fetching the data from the database is. The fetching generally takes from 1-2 minutes. However, I need it to be much faster than that. I am loading the data from db straight to DTO using following query:
select id, id_post, id_comment, col_a, col_b from post_comment
Where id is primary key, id_post and id_comment are foreign keys to respective tables and col_a and col_b are columns of small int data types. The columns with foreign keys have indexes.
The tools I am using for the job currently are Java, Spring Boot, Hibernate and PostgreSQL.
So far the only options that came to my mind were
Ditch hibernate for this query and try to use plain jdbc connection hoping that it will be faster.
Completely rewrite the processing algorithm from Java to SQL procedure.
Did I miss something or these are my only options? I am open to any ideas.
Note that I only need to read the data, not change them in any way.
EDIT: The explain analyze of the used query
"Seq Scan on post_comment (cost=0.00..397818.16 rows=21809216 width=28) (actual time=0.044..6287.066 rows=21812469 loops=1), Planning Time: 0.124 ms, Execution Time: 8237.090 ms"
Do you need to process all rows at once, or can you process them one at a time?
If you can process them one at a time, you should try using a scrollable result set.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
}
This will still remember every object in the entity manager, and so will get progressively slower and slower. To avoid that issue, you might detach the object from the entity manager after you're done. This can only be done if the objects are not modified. If they are modified, the changes will NOT be persisted.
org.hibernate.Query query = ...;
query.setReadOnly(true);
ScrollableResults sr = query.scroll(ScrollMode.FORWARD_ONLY);
while(sr.next())
{
MyClass myObject = (MyClass)sr.get()[0];
... process row for myObject ...
entityManager.detach(myObject);
}
If I was in your shoes I would definitely bypass hibernate and go directly to JDBC for this query. Hibernate is not made for dealing with large result sets, and it represents an additional overhead for benefits that are not applicable to cases like this one.
When you use JDBC, do not forget to set autocommit to false and set some large fetch size (of the order of thousands) or else postgres will first fetch all 21 million rows into memory before starting to yield them to you. (See https://stackoverflow.com/a/10959288/773113)
Since you asked for ideas, I have seen this problem being resolved in below options depending on how it fits in your environment:
1) First try with JDBC and Java, simple code and you can do a test run on your database and data to see if this improvement is enough. You will here need to compromise on the other benefits of Hibernate.
2) In point 1, use Multi-threading with multiple connections pulling data to one queue and then you can use that queue to process further or print as you need. you may consider Kafka also.
3) If data is going to further keep on increasing you can consider Spark as the latest technology which can make it all in memory and will be much more faster.
These are some of the options, please like if these ideas help you anywhere.
Why do you 30M keep in memory ??
it's better to rewrite it to pure sql and use pagination based on id
you will be sent 5 as the id of the last comment and you will issue
select id, id_post, id_comment, col_a, col_b from post_comment where id > 5 limit 20
if you need to update the entire table then you need to put the task in the cron but also there to process it in parts
the memory of the road and downloading 30M is very expensive - you need to process parts 0-20 20-n n+20
I have SpringBoot project which will pull a large amount of data from one database, do some kind of transformation on it, and then insert it into a table in a PostgreSQL database. This process will continue for a few billion records so performance is key.
I've been researching trying to find the best way to do this, such as using an ORM or a JDBCTemplate for example. One thing I keep seeing constantly regarding bulk inserts into PostgreSQL is the COPY command. https://www.postgresql.org/docs/current/populate.html
I'm confused because using COPY requires the data to be written into a file, and while I've seen people saying to use it I've yet to come across a case where someone mentions how to get the data into the file. Isn't writing to a file slow? If writing to a file is slow, then the performance gains that COPY does bring, does this make it be like there was no gain at all?
These kind of data migration and conversion is better to handle in Stored procedures. Assuming that the source data is already loaded to postgres ( if not use postgres db utility to load the raw data to some flat table). Then write series of stored procs to transform the data and insert into the destination table.
I have done some complex data migration and i used this approach. If you have to do lot of complex data conversion, write some python script ( which is usually faster than spring boot/data setup), insert the parially converted data, then do some stored procs to do the final conversion.
It is better to keep the business logic to convert/massage data close to the datasource ( in stored procs) instead of pulling data to app server and reinserting them.
Hope it helps.
I access a neo4J database via Java and I want to create 1,3 million nodes. Therefore I create 1,3 million "CREATE" statements. As I figured out, the query is way too long. I only can execute ~100 CREATE statements per query - otherwise the query fails:
Client client;
WebResource cypher;
String request;
ClientResponse cypherResponse;
String query = "";
int nrQueries = 0;
for(HashMap<String, String> entity : entities){
nrQueries++;
query += " CREATE [...] ";
if(nrQueries%100==0){
client = Client.create();
cypher = client.resource(SERVER_ROOT_URI + "cypher");
request = "{\"query\":\""+query+"\"}";
cypherResponse = cypher.accept(MediaType.APPLICATION_JSON).post(ClientResponse.class, request);
cypherResponse.close();
query = "";
}
}
Well, as I want to execute 1,3 million queries and I only can combine 100 into one request, I still have 13,000 requests, which take a long time.
Is there a way to do it faster?
You have two other options you should be considering: the import tool and the LOAD CSV option.
The right question here is "how to put data into neo4j fast" rather than "how to execute a lot of CREATE statements quickly". Both of these options will be way faster than doing individual CREATE statements, so I wouldn't mess with individual CREATEs anymore.
Michael Hunger wrote a great blog post describing multiple facets of importing data into neo4j, you should check out if you want to understand more why those are good options, not just that they are good options.
The LOAD CSV option is going to do exactly what the name suggests. You'll basically use the cypher query language to load data directly from files, and it goes substantially faster because you commit the records in "batches" (the documentation describes this). So you're still using transactions to get your data in, you're just doing it faster, in batches, and while being able to create complex relationships along the way.
The import tool is similar, except it's for very high performance creates of large volumes of data. The magic here (and why it's so fast) is that it skips the transaction layer. This is both a good thing and a bad thing, depending on your perspective (Michael Hunger's blog post I believe explains the tradeoffs).
Without knowing your data, it's hard to make a specific recommendation - but as a generality, I'd say start with LOAD CSV as a default, and move to the import tool if and only if the volume of data is really big, or your insert performance requirements are really intense. This reflects a slight bias on my part that transactions are a good thing, and that staying at the cypher layer (rather than using a separate command line tool) is also a good thing, but YMMV.
What is the best way to implement the following scenario?
I need to call/query a data base table containing millions of records from a java application. Then for each records in the table, my application should call a third party API and get a status field as response. Then my application should again update each row in the table with the information (status) from the API.
Note - I am trying to figure out a method to do this in the best possible way. I understand that querying all the records together is not the best way forward.
Do not try to eat the elephant in one bite. Chunk it. Heard of pagination? Use it. See here: MySQL pagination without double-querying?
you can use oracle feature such as SQL loader, Data pumping Called via JDBC or script..
Databases are not designed to update millions of records via Java API repeatedly. This can take many minutes. If this is not enough, you may need to use a dataset embedded in Java (either caching or replacing your database)
I am building an application at work and need some advice. I have a somewhat unique problem in which I need to gather data housed in a MS SQL Server, and transplant it to a mySQL Server every 15 mins.
I have done this previously in C# with a DataGrid, but now am trying to build a Java version that I can run on an Ubuntu Server, but I can not find a similar model for Java.
Just to give a little background
When I pull the data from the MS SQL Server, it always has 9 columns, but could have anywhere from 0 - 1000 rows.
Before inserting into the mySQL Server blindly, I do manipulate some of the data.
I convert a time column to CST based on a STATE column
I strip some characters to prevent SQL injection
I tried using the ResultSet, but I am having issues with the "forward only result sets" rules.
What would be the best data structure to hold that information, manipulate it, and then parse it to insert later into mySQL?
This sounds like a job for PreparedStatements!
Defined here: http://download.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html
Quick example: http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html
PreparedStatements allows you to batch up sets of data before pushing them into the target database. They also allow you use the PreparedStatement.setString method which handles escaping characters for you.
For the time conversion thing, I would retrieve the STATE value from the row and then retrieve the time value. Before calling PreparedStatement.setDate, convert the time to CST if necessary.
I dont think that you would need all the overhead that an ORM tool requires.
You could consider using an ORM technology like Hibernate. This might seem a little heavyweight at first, but it means you can maintain the various table mappings for various databases with ease as well as having the power of Java's RegEx lib for any manipulation requirements.
So you'd have a Java class that represents the source table (with its Hibernate mapping) and another Java class that represents the target table and lastly a conversion utility class that does any manipulation of that data. Hibernate takes care of the CRUD SQL for you, so no need to worry about Database specific SQL (as long as you get the mapping correct).
It also lessens the SQL injection problem