I access a neo4J database via Java and I want to create 1,3 million nodes. Therefore I create 1,3 million "CREATE" statements. As I figured out, the query is way too long. I only can execute ~100 CREATE statements per query - otherwise the query fails:
Client client;
WebResource cypher;
String request;
ClientResponse cypherResponse;
String query = "";
int nrQueries = 0;
for(HashMap<String, String> entity : entities){
nrQueries++;
query += " CREATE [...] ";
if(nrQueries%100==0){
client = Client.create();
cypher = client.resource(SERVER_ROOT_URI + "cypher");
request = "{\"query\":\""+query+"\"}";
cypherResponse = cypher.accept(MediaType.APPLICATION_JSON).post(ClientResponse.class, request);
cypherResponse.close();
query = "";
}
}
Well, as I want to execute 1,3 million queries and I only can combine 100 into one request, I still have 13,000 requests, which take a long time.
Is there a way to do it faster?
You have two other options you should be considering: the import tool and the LOAD CSV option.
The right question here is "how to put data into neo4j fast" rather than "how to execute a lot of CREATE statements quickly". Both of these options will be way faster than doing individual CREATE statements, so I wouldn't mess with individual CREATEs anymore.
Michael Hunger wrote a great blog post describing multiple facets of importing data into neo4j, you should check out if you want to understand more why those are good options, not just that they are good options.
The LOAD CSV option is going to do exactly what the name suggests. You'll basically use the cypher query language to load data directly from files, and it goes substantially faster because you commit the records in "batches" (the documentation describes this). So you're still using transactions to get your data in, you're just doing it faster, in batches, and while being able to create complex relationships along the way.
The import tool is similar, except it's for very high performance creates of large volumes of data. The magic here (and why it's so fast) is that it skips the transaction layer. This is both a good thing and a bad thing, depending on your perspective (Michael Hunger's blog post I believe explains the tradeoffs).
Without knowing your data, it's hard to make a specific recommendation - but as a generality, I'd say start with LOAD CSV as a default, and move to the import tool if and only if the volume of data is really big, or your insert performance requirements are really intense. This reflects a slight bias on my part that transactions are a good thing, and that staying at the cypher layer (rather than using a separate command line tool) is also a good thing, but YMMV.
Related
I have a requirement in my application: to identify expensive elasticsearch queries in the application.
I only know there's Query DSL for elasticsearch. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
I need to identify each elasticsearch query in the reverse proxy for elasticsearch (the reverse proxy is developed in java, just to throttle the requests to ES and do some user statistics), if it's expensive query, only limited users can perform at a specific rate limit.
What is difficult to me is how to identify the expensive queries. I know that there is a switch for elasticsearch, can disable / enable the expensive queries by setting this parameter. I read elasticsearch source code, but I cannot find how the elasticsearch identify different kinds of expensive queries.
If you know:
Is there any elasticsearch API (from elasticsearch client sdk) that can identify expensive queries ? Then I can invoke the API directly in my application.
If not, do you know what's the effective way to identify expensive queries by analysis the query body ? by some AST (Abstract Syntax Tree) resolver ? Or by search specific keywords in the query body ?
I'd really appreciate some help on this!
There isnt a good 'native' way to do it in Elasticsearch, but you do have some options that might help
Setting timeout or terminate_after
This option looks at your requirement from a different perspective.
From Elasticsearch docs: search-your-data
You could save records of the amount of time each query, performed by the user, took by looking at the took field returned in the result.
{
"took": 5,
"timed_out": false,
...
}
This way you have a record of how many queries a user performed in a time-windows that were 'expansive' (took more then X ).
For that user, you can start adding the timeout or terminate_after params that will try to limit the query execution. this wont prevent the user from performing an expansive query, but it will try to cancel long running queries after 'timeout' has expired, returning a partial or empty result back to the user.
GET /my-index-000001/_search
{
"timeout": "2s",
"query": {
"match": {
"user.id": "kimchy"
}
}
}
This will limit the affect of the expansive queries on the cluster, performed by that user.
a side-note; this stackoverflow answer states that there are certain queries that can still bypass the timeout/terminate_after flags, such as script.
terminate_after limits the amount of documents searched though on each of the shards, this might is an alternative option to be used, or even another backup if timeout is too high or ignored for some reason.
Long term analytics
This answer requires a lot more work probably, but you could save statistics on queries performed and the amount of time they took.
You should probably use the json representation of the queryDSL in this case, save them in an elasticsearch index along what the time that query took and keep aggregates of the average time similar queries take.
You could possibly use the rollup feature to pre-aggregate all the averages and check a query against this index if its a "possibly expansive query".
The problem here is which part of the query to save and which queries are "similar" enough to be considered for this aggregation.
Searching for keywords in the query
You stated this as an option as well. the DSL query in the end translates to a REST call with JSON body, so using a JsonNode you could look for specific sub-elements that you 'think' will make the query expansive and even limit things like 'amount of buckets' etc.
Using ObjectMapper you could write the query into a string and just look for keywords, this would be the easiest solution.
There are specific features that we know require a lot of resources from Elasticsearch and can potentially take a long time to finish, so these could be limited through this answer as a "first defense".
Examples:
Highlighting
Scripts
search_analyzers
etc...
So although this answer is the most naive, it could be a fast win while you work on a long term solution that requires analytics.
In addition to the answer by Dima with some good pointers, here is a list of usual suspects for expensive / slow queries: https://blog.bigdataboutique.com/2022/10/expensive-queries-in-elasticsearch-and-opensearch-a83194
In general we'd split the discussion into three:
Is this the query that is slow? see the list above for usual suspects. Some of them by the way can be disabled by setting search.allow_expensive_queries to false in cluster settings.
Or is it an aggregations request?
Maybe it's the cluster that is overwhelmed that makes queries slow, and not the actual queries.
The only way to figure this out is to look at cluster metrics over time, and correlate with the slow queries. You can also collect all your queries and analyze them for suspected culprits, and correlate with their latency. Usually that highlights a few things that can be improved (e.g. better use of caches, etc).
Writing APIs I used to validate all input parameters on the Java (or PHP, whatever) side, but now we moved our DBs to PostgreSQL which gives us great JSON features, like building JSON from table rows and a lot more (I didn't find anything we can't to without PGSQL JSON-functions so far). So I thought what if I do all parameters validation to Postgres (also considering that I can return JSON straight from database)?
In Java I made it like this:
if (!params.has("signature"))
//params comes from #RequestBody casted to JSONObject
return errGenerator.genErrorResponse("e01"); //this also need database access to get error description
On a Postgres I will to that like this (tested, works as expected):
CREATE OR REPLACE FUNCTION test.testFunc(_object JSON)
RETURNS TABLE(result JSON) AS
$$
BEGIN
IF (_object -> 'signature') IS NULL --so needed param is empty
THEN
RETURN QUERY (SELECT row_to_json(errors)
FROM errors
WHERE errcode = 'e01');
ELSE --everything is okay
RETURN QUERY (SELECT row_to_json(other_table)
FROM other_table);
END IF;
END;
$$
LANGUAGE 'plpgsql';
And so on...
The one problem I see so far is that if we move to MS SQL or Sybase it will need to rewrite all procedures. But as NoSQL comes more and more now, it seems to be unlikely and If we move to NoSQL DB we will also have to recode all APIs
You have to consider basically two items:
The closer you put your checks to the data storage, the safer it is. If you have the database perform all the checks, they'll be performed no matter how you interface with it, whether through your application, or through some third party tool you might be using (if even only for maintenance). In that sense, checking at the database side improves security (as in "data consistency"). In that respect, it does make all the sense to have the database perform the checks.
The closer you put your checks to the user, the fastest you can respond to his/her input. If you have a web application that needs fast response times, you probably want to have the checks on the client side.
And take into consideration an important one:
You might also have to consider your team knowledge: what the developers are more comfortable with. If you know your Java library much better than you know your database functions... it might make sense to perform all the checks Java-side.
You can have a third way: do both checks in series, first application (client) side, then database (server) side. Unless you have some sophisticated automation, this involves extra work to make sure that all checks performed are consistent. That is, there shouldn't be any data blocked at the client-side that woud be allowed to pass when checked by the database. At least, the most basic checks are performed in the first stages, and all of them (even if they're redundant) are performed in the database.
If you can afford the time to move the data through several application layers, I'd go with safety. However, the choice to be made is case-specific.
So I found some keys... The main is that I can have my error messages been cached in my application that will allow to avoid making database request if input parameters doesn't pass it and only go to database to get the result data
Is CSV the only options to speed up my bulk relationships creation?
I read many articles in internet, and they all are telling about CSV. CSV will definitely give me a performance boost (could you suppose how big?), but I'm not sure I can store data in CSV format. Any other options? How much I will get from using Neo4J 3 BOLT protocol?
My program
I'm using Neo4j 2.1.7. I try to create about 50000 relationships at once. I execute queries in batch of size 10000, and it takes about 120-140 seconds to insert all 50000.
My query looks like:
MATCH (n),(m)
WHERE id(n)=5948 and id(m)=8114
CREATE (n)-[r:MY_REL {
ID:"4611686018427387904",
TYPE: "MY_REL_1"
PROPERTY_1:"some_data_1",
PROPERTY_2:"some_data_2",
.........................
PROPERTY_14:"some_data_14"
}]->(m)
RETURN id(n),id(m),r
As it is written in the documentation:
Cypher supports querying with parameters. This means developers don’t
have to resort to string building to create a query. In addition to
that, it also makes caching of execution plans much easier for Cypher.
So, you need pack your data as parameters and pass with cypher query:
UNWIND {rows} as row
MATCH (n),(m)
WHERE id(n)=row.nid and id(m)=row.mid
CREATE (n)-[r:MY_REL {
ID:row.relId,
TYPE:row.relType,
PROPERTY_1:row.someData_1,
PROPERTY_2:row.someData_2,
.........................
PROPERTY_14:row.someData_14
}]->(m)
RETURN id(n),id(m),r
I have a CSV which is... 34 million lines long. Yes, no joking.
This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.
And the problem is in the latter.
Right now I import all rows one by one:
private void insertNodes(final DSLContext jooq)
throws IOException
{
try (
final Stream<String> lines = Files.lines(nodesPath, UTF8);
) {
lines.map(csvToNode)
.peek(ignored -> status.incrementProcessedNodes())
.forEach(r -> jooq.insertInto(NODES).set(r).execute());
}
}
csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.
Now, the line:
.peek(ignored -> status.incrementProcessedNodes())
well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
But now jooq has this (taken from their documentation) which can load directly from a CSV:
create.loadInto(AUTHOR)
.loadCSV(inputstream)
.fields(ID, AUTHOR_ID, TITLE)
.execute();
(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.
So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?
(note: should that matter, I currently use H2; a PostgreSQL version is also planned)
There are a number of ways to optimise this.
Custom load partitioning
One way to optimise query execution at your side is to collect sets of values into:
Bulk statements (as in INSERT INTO t VALUES(1), (2), (3), (4))
Batch statements (as in JDBC batch)
Commit segments (commit after N statements)
... instead of executing them one by one. This is what the Loader API also does (see below). All of these measures can heavily increase load speed.
This is the only way you can currently "listen" to loading progress.
Load partitioning using jOOQ 3.6+
(this hasn't been released yet, but it will be, soon)
jOOQ natively implements the above three partitioning measures in jOOQ 3.6
Using vendor-specific CSV loading mechanisms
jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:
H2: http://www.h2database.com/html/tutorial.html#csv
PostgreSQL: http://www.postgresql.org/docs/current/static/sql-copy.html
This will be more low-level, but certainly faster than anything else.
General remarks
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
That's a very interesting idea. Will register this as a feature request for the Loader API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?
though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account
We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:
Constraints on the table
Indexes on the table
Logging turned on/off
etc.
jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:
Set the commit rate (e.g. commit every 1000 rows) to avoid long UNDO / REDO logs in case you're inserting with logging turned on. This can be done via the commitXXX() methods.
In jOOQ 3.6+, you can also:
Set the bulk statement rate (e.g. combine 10 rows in a single statement) to drastically speed up execution. This can be done via the bulkXXX() methods.
Set the batch statement rate (e.g. combine 10 statements in a single JDBC batch) to drastically speed up execution (see this blog post for details). This can be done via the batchXXX() methods.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Very Open question,
I need to write a java client that reads millions of records (let's say account information) from an Oracle database. Dump it into a XML and send it through webservices to a vendor.
What is the most optimized way to do this? starting from fetching the millions of records. I Went the JPA/hibernate route I got outofMemory errors fetching 2 million records.
Is JDBC better approach? fetch each row and build the XML as I go? any other alternatives?
I am not an expert in Java so any guidance is appreciated.
We faced similar problem sometime back and our record size was in excess of 2M. This is how we approached.
Using any OR mapping tool is simply ruled out due to large overheads like creation of large POJOs which basically is not required if the data is to be dumped to an XML.
Plain JDBC is the way to go. The main advantage of this is that it returns a ResultSet object which actually does not contain all the results at once. So loading of entire data in memory is solved. The data is loaded as we iterate over the ResultSet
Next comes the creation of XML file. We create an XML file and opened than in Append mode.
Now in loop where we iterate over Resultset object, we create XML fragments and then append the same to the XML file. This goes on till entire Resultset is iterated.
In the end what we have is XML file will all the records.
Now for sharing this file, we created a web services which would return the URL to this XML file (archived/zipped) if the file is available.
The client could download this file anytime after this.
Note this this is not a synchronous system, meaning The file does not become available after the client makes the call. Since creating XML call takes a lot of time, HTTP wold normally timeout hence this approach.
Just an approach you can take clue from. Hope this helps.
Use ResultSet#setFetchSize() to optimize the records fetched at time from database.
See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
In JDBC, the ResultSet#setFetchSize(int) method is very important to
performance and memory-management within the JVM as it controls the
number of network calls from the JVM to the database and
correspondingly the amount of RAM used for ResultSet processing.
Read here about Oracle ResultSet Fetch Size
For this size of data, you can probably get away with starting java with more memory. Check out using -Xmx and -Xms when you start Java.
If your data is truly too big to fit in memory, but not big enough to warrant investment in different technology, think about operating in chunks. Does this have to be done at once? Can you slice up the data into 10 chunks and do each chunk independently? If it has to be done in one shot, can you stream data from the database, and then stream it into the file, forgetting about things you are done with (to keep memory use in the JVM low)?
Read the records in chunks, as explained by previous answers.
Use StAX http://stax.codehaus.org/ to stream the record chunks to your XML file as opposed to all records into one large document
As far as the Hibernate side is concerned, fetch using a SELECT query (instead of a FROM query) to prevent filling up the caches; alternatively use a statelessSession. Also be sure to use scroll() instead of list(). Configuring hibernate.jdbc.fetch_size to something like 200 is also recommended.
On the response side, XML is a quite bad choice because parsing is difficult. If this is already set, then make sure you use a streaming XML serializer. For example, the XPP3 library contains one.
While a reasonable Java approach would probably involve a StAX construction of your XML in conjunction to paginated result sets (straightforward JDBC or JPA), keep in mind that you may need to lock your database for updates all the while which may or may not be acceptable in your case.
We took a different, database-centric approach using stored procedures and triggers on INSERT and UPDATE to generate the XML node corresponding to each row/[block of] data. This constantly ensures that 250GB+ of raw data and its XML representation (~10 GB) are up-to-date and reduces (no pun intended) the export to a mere concatenation matter.
You can still use Hibernate to fetch millions of data, it's just that you cannot do it in one round because millions is a big number and of course you will have out of memory exception. You can divide it into pages and then dump to XML each time, so that the records won't be keep in RAM and your program would not be needing so huge of memory.
I have these 2 methods in my previous project that I used very frequently. Unfortunately I did not like to use HQL so much so I don't have the code for that.
So here INT_PAGE_SIZE is the amount of rows that you would like to fetch each round, and getPageCount is to get the amount of total rounds to do to fetch all of the records.
Then paging is to fetch the records by page, from 1 to getPageCount.
public int getPageCount(Criteria criteria) {
ProjectionList pl = Projections.projectionList();
pl.add(Projections.rowCount());
criteria.setProjection(pl);
int rowCount = (Integer) criteria.list().get(0);
criteria.setProjection(null);
if (rowCount % INT_PAGE_SIZE == 0) {
return rowCount / INT_PAGE_SIZE;
}
return rowCount / INT_PAGE_SIZE + 1;
}
public Criteria paging(Criteria criteria, int page) {
if (page != -1) {
criteria.setFirstResult((page - 1) * INT_PAGE_SIZE);
criteria.setMaxResults(INT_PAGE_SIZE);
}
return criteria;
}