How to scanning and deleting millions of rows in HBase - java

What Happened
All the data from last month was corrupted due to a bug in the system. So we have to delete and re-input these records manually. Basically, I want to delete all the rows inserted during a certain period of time. However, I found it difficult to scan and delete millions of rows in HBase.
Possible Solutions
I found two way to bulk delete:
The first one is to set a TTL, so that all the outdated record would be deleted automatically by the system. But I want to keep the records inserted before last month, so this solution does not work for me.
The second option is to write a client using the Java API:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
Table table = null;
Connection connection = null;
try {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
connection = HBaseOperator.getHbaseConnection();
table = connection.getTable(TableName.valueOf(tableName));
ResultScanner rs = table.getScanner(scan);
List<Delete> list = getDeleteList(rs);
if (list.size() > 0) {
table.delete(list);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (null != table) {
try {
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (connection != null) {
try {
connection.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private static List<Delete> getDeleteList(ResultScanner rs) {
List<Delete> list = new ArrayList<>();
try {
for (Result r : rs) {
Delete d = new Delete(r.getRow());
list.add(d);
}
} finally {
rs.close();
}
return list;
}
But in this approach, all the records are stored in ResultScanner rs, so the heap size would be huge. And if the program crushes, it has to start from the beginning.
So, is there a better way to achieve the goal?

Don't know how many 'millions' you are dealing with in your table, but the simples thing is to not try to put them all into a List at once but to do it in more manageable steps by using the .next(n) function. Something like this:
for (Result row : rs.next(numRows))
{
Delete del = new Delete(row.getRow());
...
}
This way, you can control how many rows get returned from the server via a single RPC through the numRows parameter. Make sure it's large enough so as not to make too many round-trips to the server, but at the same time not too large to kill your heap. You can also use the BufferedMutator to operate on multiple Deletes at once.
Hope this helps.

I would suggest two improvements:
Use BufferedMutator to batch your deletes,  it does exactly what you need – keeps internal buffer of mutations and flushes it to HBase when buffer fills up, so you do not have to worry about keeping your own list, sizing and flushing it.
Improve your scan:
Use KeyOnlyFilter – since you do not need the values, no need to retrieve them
use scan.setCacheBlocks(false) - since you do a full-table scan, caching all blocks on the region server does not make much sense
tune scan.setCaching(N) and scan.setBatch(N) – the N will depend on the size of your keys, you should keep a balance between caching more and memory it will require; but since you only transfer keys, the N could be quite large, I suppose.
Here's an updated version of your code:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
try (Connection connection = HBaseOperator.getHbaseConnection();
final Table table = connection.getTable(TableName.valueOf(tableName));
final BufferedMutator mutator = connection.getBufferedMutator(TableName.valueOf(tableName))) {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
scan.setFilter(new KeyOnlyFilter());
scan.setCaching(1000);
scan.setBatch(1000);
scan.setCacheBlocks(false);
try (ResultScanner rs = table.getScanner(scan)) {
for (Result result : rs) {
mutator.mutate(new Delete(result.getRow()));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Note the use of "try with resource" – if you omit that, make sure to .close() mutator, rs, table, and connection.

Related

Zebra RFID API Read Access Operation Code return null

I'm trying to develop a small Application for a Zebra handheld rfid reader and can't find a way to access the MemoryBank of the tag. My reader configuration is as follows:
private void ConfigureReader() {
if (reader.isConnected()) {
TriggerInfo triggerInfo = new TriggerInfo();
triggerInfo.StartTrigger.setTriggerType(START_TRIGGER_TYPE.START_TRIGGER_TYPE_IMMEDIATE);
triggerInfo.StopTrigger.setTriggerType(STOP_TRIGGER_TYPE.STOP_TRIGGER_TYPE_IMMEDIATE);
try {
// receive events from reader
if (eventHandler == null){
eventHandler = new EventHandler();
}
reader.Events.addEventsListener(eventHandler);
// HH event
reader.Events.setHandheldEvent(true);
// tag event with tag data
reader.Events.setTagReadEvent(true);
reader.Events.setAttachTagDataWithReadEvent(true);
// set trigger mode as rfid so scanner beam will not come
reader.Config.setTriggerMode(ENUM_TRIGGER_MODE.RFID_MODE, true);
// set start and stop triggers
reader.Config.setStartTrigger(triggerInfo.StartTrigger);
reader.Config.setStopTrigger(triggerInfo.StopTrigger);
} catch (InvalidUsageException e) {
e.printStackTrace();
} catch (OperationFailureException e) {
e.printStackTrace();
}
}
}
And the eventReadNotify looks like this:
public void eventReadNotify(RfidReadEvents e) {
// Recommended to use new method getReadTagsEx for better performance in case of large tag population
TagData[] myTags = reader.Actions.getReadTags(100);
if (myTags != null) {
for (int index = 0; index < myTags.length; index++) {
Log.d(TAG, "Tag ID " + myTags[index].getTagID());
ACCESS_OPERATION_CODE aoc = myTags[index].getOpCode();
ACCESS_OPERATION_STATUS aos = myTags[index].getOpStatus();
if (aoc == ACCESS_OPERATION_CODE.ACCESS_OPERATION_READ && aos == ACCESS_OPERATION_STATUS.ACCESS_SUCCESS) {
if (myTags[index].getMemoryBankData().length() > 0) {
Log.d(TAG, " Mem Bank Data " + myTags[index].getMemoryBankData());
}
}
}
}
}
When I'm scanning a tag I get the correct TagID but both myTags[index].getOpCode() and myTags[index].getOpStatus() return null values.
I appreciate every suggestion that might lead to a successful scan.
Thanks.
I managed to find a solution for my problem. To perform any Read or Write task with Zebra Handheld Scanners the following two conditions must be satisfied. Look here for reference: How to write to RFID tag using RFIDLibrary by Zebra?
// make sure Inventory is stopped
reader.Actions.Inventory.stop();
// make sure DPO is disabled
reader.Config.setDPOState(DYNAMIC_POWER_OPTIMIZATION.DISABLE);
You have to stop the inventory and make sure to disable dpo in order to get data other than the TagID from a Tag. Unfortunately this isn't mentioned in the docu for Reading RFID Tags.

Best way to get BigQuery temp table created by Job to read large data faster

I am trying to execute a query over a table in BigQuery using its Java client libraries. I create a Job and then get the result of Job using job.getQueryResults().iterateAll() method.
This way is working but for large data like 600k it takes time around 80-120 seconds. I see BigQuery gets data in 40-45k batches which takes around 5-7 sec each.
I want to get the results faster and I found over internet that if we can get the temporary table created by BigQuery from the Job and the read the data in avro or some other format from that table if will be really fast, but in BigQuery API(using version: 1.124.7) I don't see that way.
Does anyone know how to do that in Java, or how to get data faster in case of large number of records.
Any help is appreciated.
Code to Read Table(Takes 20 sec)
Table table = bigQueryHelper.getBigQueryClient().getTable(TableId.of("project","dataset","table"));
String format = "CSV";
String gcsUrl = "gs://name/test.csv";
Job job = table.extract(format, gcsUrl);
// Wait for the job to complete
try {
Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
RetryOption.totalTimeout(Duration.ofMinutes(3)));
if (completedJob != null && completedJob.getStatus().getError() == null) {
log.info("job done");
// Job completed successfully
} else {
log.info("job has error");
// Handle error case
}
} catch (InterruptedException e) {
// Handle interrupted wait
}
Code to read same table using Query(Takes 90 Sec)
Job job = bigQueryHelper.getBigQueryClient().getJob(JobId.of(jobId));
for (FieldValueList row : job.getQueryResults().iterateAll()) {
System.out.println(row);
}
I tried certain ways and based on that found the best way of doing it, just thought to post here to help some one in future.
1: If we use job.getQueryResults().iterateAll() on job or directly on table, it takes same time. So if we don't give batch size BigQuery will use batch size of around 35-45k and fetch the data. So for 600k rows (180Mb) it takes 70-100 sec.
2: We can use the temp table details from created job and use extract job feature of table to write the result in GCS, this will be faster and takes around 30-35 sec. This approach would not download on local for that we again need to use ..iterateAll() on temp table and it will be take same time as 1.
Example pseudo code:
try {
Job job = getBigQueryClient().getJob(JobId.of(jobId));
long start = System.currentTimeMillis();
// FieldList list = getFields(job);
Job completedJob =
job.waitFor(
RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
RetryOption.totalTimeout(Duration.ofMinutes(3)));
if (completedJob != null && completedJob.getStatus().getError() == null) {
log.info("job done");
String gcsUrl = "gs://bucketname/test";
//getting the temp table information of the Job
TableId destinationTableInfo =
((QueryJobConfiguration) job.getConfiguration()).getDestinationTable();
log.info("Total time taken in getting schema ::{}", (System.currentTimeMillis() - start));
Table table = bigQueryHelper.getBigQueryClient().getTable(destinationTableInfo);
//Using extract job to write the data in GCS
Job newJob1 =
table.extract(
CsvOptions.newBuilder().setFieldDelimiter("\t").build().toString(), gcsUrl);
System.out.println("DestinationInfo::" + destinationTableInfo);
Job completedJob1 =
newJob1.waitFor(
RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
RetryOption.totalTimeout(Duration.ofMinutes(3)));
if (completedJob1 != null && completedJob1.getStatus().getError() == null) {
log.info("job done");
} else {
log.info("job has error");
}
} else {
log.info("job has error");
}
} catch (InterruptedException e) {
e.printStackTrace();
}
3: This is the best way which I wanted. It downloads/writes the result faster in local file. It downloads data in around 20 sec. This is the new way BigQuery provides and can be checked using below links:
https://cloud.google.com/bigquery/docs/reference/storage#background
List item
https://cloud.google.com/bigquery/docs/reference/storage/libraries#client-libraries-install-java

Java unique code generation failed while calling the recurring function

We have to implement a logic to write the unique code generation in Java. The concept is when we generate the code the system will check if the code is already generate or not. If already generate the system create new code and check again. But this logic fails in some case and we cannot able to identify what is the issue is
Here is the code to create the unique code
Integer code = null;
try {
int max = 999999;
int min = 100000;
code = (int) Math.round(Math.random() * (max - min + 1) + min);
PreOrders preObj = null;
preObj = WebServiceDao.getInstance().preOrderObj(code.toString());
if (preObj != null) {
createCode();
}
} catch (Exception e) {
exceptionCaught();
e.printStackTrace();
log.error("Exception in method createCode() - " + e.toString());
}
return code;
}
The function preOrderObj is calling a function to check the code exists in the database if exists return the object. We are using Hibernate to map the database functions and Mysql on the backend.
Here is the function preOrderObj
PreOrders preOrderObj = null;
List<PreOrders> preOrderList = null;
SessionFactory sessionFactory =
(SessionFactory) ServletActionContext.getServletContext().getAttribute(HibernateListener.KEY_NAME);
Session Hibernatesession = sessionFactory.openSession();
try {
Hibernatesession.beginTransaction();
preOrderList = Hibernatesession.createCriteria(PreOrders.class).add(Restrictions.eq("code", code)).list(); // removed .add(Restrictions.eq("status", true))
if (!preOrderList.isEmpty()) {
preOrderObj = (PreOrders) preOrderList.iterator().next();
}
Hibernatesession.getTransaction().commit();
Hibernatesession.flush();
} catch (Exception e) {
Hibernatesession.getTransaction().rollback();
log.debug("This is my debug message.");
log.info("This is my info message.");
log.warn("This is my warn message.");
log.error("This is my error message.");
log.fatal("Fatal error " + e.getStackTrace().toString());
} finally {
Hibernatesession.close();
}
return preOrderObj;
}
Please guide us to identify the issue.
In createCode method, when the random code generated already exist in database, you try to call createCode again. However, the return value from the recursive call is not updated to the code variable, hence the colliding code is still returned and cause error.
To fix the problem, update the method as
...
if (preObj != null) {
//createCode();
code = createCode();
}
...
Such that the code is updated.
By the way, using random number to generate unique value and test uniqueness through query is a bit strange. You may try Auto Increment if you want unique value.

fast insertion of many entities in Hibernate

I want to insert a List of 170.000 entities into my local installed MySQL 8.0 database using Hibernate 4.2.
Currently I'm doing this via the Session#save method. But inserting those many entities lasts so long. So is there a possibility to do this faster?
for (Agagf x : list) {
create(x);
}
// ------------------------
public static void create(Object obj) throws DatabaseException {
Session hsession = null;
try {
hsession = SqlDataHibernateUtil.getSessionFactory().openSession();
Transaction htransaction = hsession.beginTransaction();
hsession.save(obj);
htransaction.commit();
} catch (HibernateException ex) {
throw new DatabaseException(ex);
} finally {
if (hsession != null)
hsession.close();
}
}
There is this article on Hibernate page: http://docs.jboss.org/hibernate/orm/4.2/manual/en-US/html/ch15.html
According to them, you would need something like this:
create(list);
// ------------------------
public static void create(List<Object> objList) throws DatabaseException {
Session hsession = null;
try {
hsession = SqlDataHibernateUtil.getSessionFactory().openSession();
Transaction htransaction = hsession.beginTransaction();
int count = 0;
for(Agagf x: objList) {
hsession.save(obj);
if ( ++count % 20 == 0 ) { //20, same as the JDBC batch size
//flush a batch of inserts and release memory:
hsession.flush();
hsession.clear();
count = 0;
}
}
} catch (HibernateException ex) {
throw new DatabaseException(ex);
} finally {
htransaction.commit();
if (hsession != null) {
hsession.close();
}
}
}
Also, the configuration to enable batch processing:
if you are undertaking batch processing you will need to enable the use of JDBC
batching. This is absolutely essential if you want to achieve optimal
performance. Set the JDBC batch size to a reasonable number (10-50, for example):
hibernate.jdbc.batch_size 20
Edit: In your case, play with the batch size to better fit your volume. Just remember to make the same size in both configuration and if statement for flush.

Hibernate causes out of memory exception when saving large number of entities

In my application I'm using CSVReader & hibernate to import large amount of entities (like 1 500 000 or more) into database from a csv file. The code looks like this:
Session session = headerdao.getSessionFactory().openSession();
Transaction tx = session.beginTransaction();
int count = 0;
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
try {
if (nextLine.length == 23
&& Integer.parseInt(nextLine[0]) > lastIdInDB) {
JournalHeader current = parseJournalHeader(nextLine);
current.setChain(chain);
session.save(current);
count++;
if (count % 100 == 0) {
session.flush();
tx.commit();
session.clear();
tx.begin();
}
if (count % 10000 == 0) {
LOG.info(count);
}
}
} catch (NumberFormatException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
tx.commit();
session.close();
With large enough files (somewhere around 700 000 lines) I get out of memory exception (heap space).
It seems that the problem is somehow hibernate related, because if I comment just the line session.save(current); it runs fine. If it's uncommented, the task manager shows continuously increasing memory usage of javaw and then at some point the parsing gets real slow and it crashes.
parseJournalHeader() does nothing special, it just parses an entity based on the String[] that the csv reader gives.
Session actually persists objects in cache. You are doing correct things to deal with first-level cache. But there's more things which prevent garbage collection from happening.
Try to use StatelessSession instead.

Categories