I got empty result error when accessing neo4j database with the following setup.
Neo4j runs in docker.
Uploader process runs continuously with 0.5 second sleep between each
upload
Reader process runs continuously as well
Reader:
Driver driver = GraphDatabase.driver("bolt://db_address:7687", AuthTokens.basic("user", "password"));
while (true) {
try (Session session = driver.session(AccessMode.READ)) {
for (int i = 1; i <= 100; i++) {
session.run("Match (n:Number) where n.value=$value return ID(n)", parameters("value", i)).single().get(0).asInt();
}
}
Thread.sleep(500);
}
Uploader:
Driver driver = GraphDatabase.driver("bolt://db_address:7687", AuthTokens.basic("user", "password"));
while (true) {
try (Session session = driver.session(AccessMode.WRITE)) {
try (Transaction tx = session.beginTransaction()) {
tx.run("MATCH (n) DELETE n");
for (int i = 1; i <= 100; i++) {
tx.run("CREATE (n:Number {value: $value}) return ID(n)", parameters("value", i)).single().get(0).asInt();
}
tx.success();
}
}
Thread.sleep(500);
}
After few cycles I get error in the reader process:
Cannot retrieve a single record, because this result is empty.
At start the database contains the requested data.
Based on the description of "write transaction" and the code above the empty result seems strange.
Did I miss something with the transaction handling with neo4j?
You need to call tx.success() to commit the transaction.
PS: not sure why the database is cleared on every upload run
Related
I want to insert a List of 170.000 entities into my local installed MySQL 8.0 database using Hibernate 4.2.
Currently I'm doing this via the Session#save method. But inserting those many entities lasts so long. So is there a possibility to do this faster?
for (Agagf x : list) {
create(x);
}
// ------------------------
public static void create(Object obj) throws DatabaseException {
Session hsession = null;
try {
hsession = SqlDataHibernateUtil.getSessionFactory().openSession();
Transaction htransaction = hsession.beginTransaction();
hsession.save(obj);
htransaction.commit();
} catch (HibernateException ex) {
throw new DatabaseException(ex);
} finally {
if (hsession != null)
hsession.close();
}
}
There is this article on Hibernate page: http://docs.jboss.org/hibernate/orm/4.2/manual/en-US/html/ch15.html
According to them, you would need something like this:
create(list);
// ------------------------
public static void create(List<Object> objList) throws DatabaseException {
Session hsession = null;
try {
hsession = SqlDataHibernateUtil.getSessionFactory().openSession();
Transaction htransaction = hsession.beginTransaction();
int count = 0;
for(Agagf x: objList) {
hsession.save(obj);
if ( ++count % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
hsession.flush();
hsession.clear();
count = 0;
}
}
} catch (HibernateException ex) {
throw new DatabaseException(ex);
} finally {
htransaction.commit();
if (hsession != null) {
hsession.close();
}
}
}
Also, the configuration to enable batch processing:
if you are undertaking batch processing you will need to enable the use of JDBC
batching. This is absolutely essential if you want to achieve optimal
performance. Set the JDBC batch size to a reasonable number (10-50, for example):
hibernate.jdbc.batch_size 20
Edit: In your case, play with the batch size to better fit your volume. Just remember to make the same size in both configuration and if statement for flush.
What Happened
All the data from last month was corrupted due to a bug in the system. So we have to delete and re-input these records manually. Basically, I want to delete all the rows inserted during a certain period of time. However, I found it difficult to scan and delete millions of rows in HBase.
Possible Solutions
I found two way to bulk delete:
The first one is to set a TTL, so that all the outdated record would be deleted automatically by the system. But I want to keep the records inserted before last month, so this solution does not work for me.
The second option is to write a client using the Java API:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
Table table = null;
Connection connection = null;
try {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
connection = HBaseOperator.getHbaseConnection();
table = connection.getTable(TableName.valueOf(tableName));
ResultScanner rs = table.getScanner(scan);
List<Delete> list = getDeleteList(rs);
if (list.size() > 0) {
table.delete(list);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (null != table) {
try {
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (connection != null) {
try {
connection.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private static List<Delete> getDeleteList(ResultScanner rs) {
List<Delete> list = new ArrayList<>();
try {
for (Result r : rs) {
Delete d = new Delete(r.getRow());
list.add(d);
}
} finally {
rs.close();
}
return list;
}
But in this approach, all the records are stored in ResultScanner rs, so the heap size would be huge. And if the program crushes, it has to start from the beginning.
So, is there a better way to achieve the goal?
Don't know how many 'millions' you are dealing with in your table, but the simples thing is to not try to put them all into a List at once but to do it in more manageable steps by using the .next(n) function. Something like this:
for (Result row : rs.next(numRows))
{
Delete del = new Delete(row.getRow());
...
}
This way, you can control how many rows get returned from the server via a single RPC through the numRows parameter. Make sure it's large enough so as not to make too many round-trips to the server, but at the same time not too large to kill your heap. You can also use the BufferedMutator to operate on multiple Deletes at once.
Hope this helps.
I would suggest two improvements:
Use BufferedMutator to batch your deletes, it does exactly what you need – keeps internal buffer of mutations and flushes it to HBase when buffer fills up, so you do not have to worry about keeping your own list, sizing and flushing it.
Improve your scan:
Use KeyOnlyFilter – since you do not need the values, no need to retrieve them
use scan.setCacheBlocks(false) - since you do a full-table scan, caching all blocks on the region server does not make much sense
tune scan.setCaching(N) and scan.setBatch(N) – the N will depend on the size of your keys, you should keep a balance between caching more and memory it will require; but since you only transfer keys, the N could be quite large, I suppose.
Here's an updated version of your code:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
try (Connection connection = HBaseOperator.getHbaseConnection();
final Table table = connection.getTable(TableName.valueOf(tableName));
final BufferedMutator mutator = connection.getBufferedMutator(TableName.valueOf(tableName))) {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
scan.setFilter(new KeyOnlyFilter());
scan.setCaching(1000);
scan.setBatch(1000);
scan.setCacheBlocks(false);
try (ResultScanner rs = table.getScanner(scan)) {
for (Result result : rs) {
mutator.mutate(new Delete(result.getRow()));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Note the use of "try with resource" – if you omit that, make sure to .close() mutator, rs, table, and connection.
I have this java agent that processes a huge amount of documents that it could run overnight. The problem is that I need the agent to retry if the network got suddenly disconnected briefly. The retry could have a maximum number.
int numberOfRetries = 0;
try {
while(nextdoc != null) {
// process documents
numberOfRetries = 0;
}
} catch (NotesException e) {
numberOfRetries++;
if (numberOfRetries > 4) {
// go back and reprocess current document
} else {
// message reached max number of retries. did not successfully finished
}
}
Also, of course I do not want to actually retry the whole process. Basically I need to continue on the document it was processing and move on to the next loop
You should do a retry loop around each piece of code that gets a document. Since the Notes classes generally require a getFirst and getNext paradigm, that means you need two separate retry loops. E.g.,
numberOfRetries = 0;
maxRetries = 4;
// get first document, with retries
needToRetry = false;
while (needToRetry)
{
try
{
while (needToRetry)
{
nextDoc = myView.getFirstDocument();
needToRetry=false;
}
}
catch (NotesException e)
{
numberOfRetries++;
if (numberOfRetries < maxRetries) {
// you might want to sleep here to wait for the network to recover
// you could use numberOfRetries as a factor to sleep longer on
// each failure
needToRetry = true;
} else {
// write "Max retries have been exceeded getting first document" to log
nextDoc = null; // we won't go into the processing loop
}
}
}
// process all documents
while(nextdoc != null)
{
// process nextDoc
// insert your code here
// now get next document, with retries
while (needToRetry)
{
try
{
nextDoc = myView.getNextDocument();
needToRetry=false;
}
catch (NotesException e)
{
numberOfRetries++;
if (numberOfRetries < maxRetries) {
// you might want to sleep here to wait for the network to recover
// you could use numberOfRetries as a factor to sleep longer on
// each failure
needToRetry = true;
} else {
// write "Max retries have been exceeded getting first document" to log
nextDoc = false; // we'lll be exiting the processing loop without finishing all docs
}
}
}
}
Note that I'm treating maxRetries as the max total retries across all documents in the data set, not the max for each document.
Also note that it's probably cleaner to break this up a little. E.g.
numberOfRetries = 0;
maxRetries = 4;
nextDoc = getFirstDocWithRetries(view); // this contains while loop and try-catch
while (nextDoc != null)
{
processOneDoc(nextDoc);
nextDoc = getNextDocWithRetries(view,nextDoc); // and so does this
}
I would not recommend what you are doing at all.
The NotesException can fire for a number of reasons, and there is no guarantee you will be returning to a safe state.
Also the fact the agent needs to run for such a long time means you need to change the server "Maximum execution timeout" to allow it to run correctly. Setting that to a very high value makes the server more prone to performance/deadlock issues.
A better solution would be to batch workload and have the agent run for a set time on a batch. Update as you go so that when the agent comes back it knows to work on the next batch.
I want to use the elasticsearch bulk api using java and wondering how I can set the batch size.
Currently I am using it as:
BulkRequestBuilder bulkRequest = getClient().prepareBulk();
while(hasMore) {
bulkRequest.add(getClient().prepareIndex(indexName, indexType, artist.getDocId()).setSource(json));
hasMore = checkHasMore();
}
BulkResponse bResp = bulkRequest.execute().actionGet();
//To check failures
log.info("Has failures? {}", bResp.hasFailures());
Any idea how I can set the bulk/batch size?
It mainly depends on the size of your documents, available resources on the client and the type of client (transport client or node client).
The node client is aware of the shards over the cluster and sends the documents directly to the nodes that hold the shards where they are supposed to be indexed. On the other hand the transport client is a normal client that sends its requests to a list of nodes in a round-robin fashion. The bulk request would be sent to one node then, which would become your gateway when indexing.
Since you're using the Java API, I would suggest you to have a look at the BulkProcessor, which makes it much easier and flexibile to index in bulk. You can either define a maximum number of actions, a maximum size and a maximum time interval since the last bulk execution. It's going to execute the bulk automatically for you when needed. You can also set a maximum number of concurrent bulk requests.
After you created the BulkProcessor like this:
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
#Override
public void beforeBulk(long executionId, BulkRequest request) {
logger.info("Going to execute new bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
logger.info("Executed bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
logger.warn("Error executing bulk", failure);
}
}).setBulkActions(bulkSize).setConcurrentRequests(maxConcurrentBulk).build();
You just have to add your requests to it:
bulkProcessor.add(indexRequest);
and close it at the end to flush any eventual requests that might have not been executed yet:
bulkProcessor.close();
To finally answer your question: the nice thing about the BulkProcessor is also that it has sensible defaults: 5 MB of size, 1000 actions, 1 concurrent request, no flush interval (which might be useful to set).
you need to count your bulk request builder when it hits your batch size limit then index them and flush older bulk builds .
here is example of code
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "MyClusterName").build();
TransportClient client = new TransportClient(settings);
String hostname = "myhost ip";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(hostname, port));
BulkRequestBuilder bulkBuilder = client.prepareBulk();
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("my_file_path"))));
long bulkBuilderLength = 0;
String readLine = "";
String index = "my_index_name";
String type = "my_type_name";
String id = "";
while((readLine = br.readLine()) != null){
id = somefunction(readLine);
String json = new ObjectMapper().writeValueAsString(readLine);
bulkBuilder.add(client.prepareIndex(index, type, id).setSource(json));
bulkBuilderLength++;
if(bulkBuilderLength % 1000== 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
hope this helps you
thanks
In my application I'm using CSVReader & hibernate to import large amount of entities (like 1 500 000 or more) into database from a csv file. The code looks like this:
Session session = headerdao.getSessionFactory().openSession();
Transaction tx = session.beginTransaction();
int count = 0;
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
try {
if (nextLine.length == 23
&& Integer.parseInt(nextLine[0]) > lastIdInDB) {
JournalHeader current = parseJournalHeader(nextLine);
current.setChain(chain);
session.save(current);
count++;
if (count % 100 == 0) {
session.flush();
tx.commit();
session.clear();
tx.begin();
}
if (count % 10000 == 0) {
LOG.info(count);
}
}
} catch (NumberFormatException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
tx.commit();
session.close();
With large enough files (somewhere around 700 000 lines) I get out of memory exception (heap space).
It seems that the problem is somehow hibernate related, because if I comment just the line session.save(current); it runs fine. If it's uncommented, the task manager shows continuously increasing memory usage of javaw and then at some point the parsing gets real slow and it crashes.
parseJournalHeader() does nothing special, it just parses an entity based on the String[] that the csv reader gives.
Session actually persists objects in cache. You are doing correct things to deal with first-level cache. But there's more things which prevent garbage collection from happening.
Try to use StatelessSession instead.