I'm trying to find the best/optimal way of loading larger amounts of data form MySQL database in a Spring/Hibernate service.
I pull about 100k records from a 3rd party API (in chunks usually between 300-1000) I then need to pull translations for each record from database since there are 30 languages that means that there will be 30 rows per record so 1000 records from API is 30,000 rows from database.
The records from API come in form of POJO (super small in size) say I get 1000 records I split the list into multiple 100 record lists and then collect id's of each record and select all translations from database for this record. I only need two values from the table which I than add to my POJOs and then I push the POJOs to the next service.
Basically this:
interface i18nRepository extends CrudRepository<Translation, Long> {}
List<APIRecord> records = api.findRecords(...);
List<List<APIRecord>> partitioned = Lists.partition(records, 100); // Guava
for(List<APIRecord> chunk : partitioned) {
List<Long> ids = new ArrayList();
for(APIRecord record : chunk) {
ids.add(record.getId());
}
List<Translation> translations = i18Repository.findAllByRecordIdIn(ids);
for(APIRecord record : chunk) {
for(Translation translation : translations) {
if (translation.getRedordId() == record.getId()) {
record.addTranslation(translation);
}
}
}
}
As far as spring-boot/hibernate properties go I only have default ones set. I would like to make this as efficient, fast and memory lite as possible. One idea I had was to use the lower layer API instead of Hibernate to bypass object mapping.
In my opinion, you should bypass JPA/Hibernate for bulk operations.
There's no way to make bulk operations efficient in JPA.
Consider using Spring's JpaTemplate and native SQL.
Related
I have a large Excel file, with 32k rows, and a Java-Spring code that persist the Excel data to mySQL database. My code works for about 6k row's, but not for the entire Excel due to JPA limitation. I read that it can be done with JPA Pagination but all so far I found only info that collect data from DB (already persisted with data) and render to a UI. The Excel file contain 32k medicines, and this rows will be persisted into DB.
I have this Controller layer with the following method:
public ResponseEntity<ResponseMessage> uploadFile(#RequestParam("file") MultipartFile file,
#RequestParam(defaultValue = "0") int page,
#RequestParam(defaultValue = "6000") int size) {
String message = "";
if (ExcelHelper.hasExcelFormat(file)) {
try {
// the following 6 row are my patetic attempt to resolve with pagination
List<Medicine> medicines = new ArrayList<>();
Pageable paging = PageRequest.of(page, size);
Page<Medicine> pageMedicamente = medicineRepositoryDao.save(paging);
medicines = pageMedicamente.getContent();
medicineService.save(file);
message = "Uploaded the file successfully: " + file.getOriginalFilename();
return ResponseEntity.status(HttpStatus.OK).body(new ResponseMessage(message));
} catch (Exception e) {
message = "Could not upload the file: " + file.getOriginalFilename() + "!";
return ResponseEntity.status(HttpStatus.EXPECTATION_FAILED).body(new ResponseMessage(message));
}
}
And the Repository layer:
#Repository
public interface MedicineRepositoryDao extends JpaRepository<Medicine, Long> {
Page<Medicine> save( Pageable pageable);
}
And also the Service layer:
try {
List<Medicine> medicines = ExcelHelper.excelToMedicine(file.getInputStream());
medicineRepositoryDao.saveAll(medicines);
} catch (IOException e) {
throw new RuntimeException("fail to store excel data: " + e.getMessage());
}
}
I think you have a couple of things mixed up here.
I don't think Spring has any relevant limitation on the number of rows you may persist here. But JPA does. JPA does keeps are reference to any entity that you save in its first level cache. So for large number of rows/entities this hogs memory and also makes some operations slower since entities get looked up or processed one by one.
Pagination is for reading entities, not for saving.
You have a couple of options in this situation.
Don't use JPA. For simply writing data from a file and writing it into a database JPA does hardly offer any benefit. This can almost trivially performed using just a JdbcTemplate or NamedParameterJdbcTemplate and will be much faster, since the overhead of JPA is skipped which you don't benefit from anyway in this scenario. If you want to use an ORM you might want to take a look at Spring Data JDBC which is conceptually simpler and doesn't keep references to entities and therefore should show better characteristics in this scenario. I recommend not using an ORM here since you don't seem to benefit from having entities, so creating them and then having the ORM extract the data from it is really a waste of time.
Break your import into batches. This means you persist e.g. 1000 rows at time, write them to the database and commit the transaction, before you continue with the next 1000 rows. For JPA this is pretty much a necessity for the reasons laid out above. With JDBC (i.e. JdbcTemplate&Co) this probably isn't necessary for 32K rows but might improve performance and might be useful for recoverability if an insert fails. Spring Batch will help you implement that.
While the previous point talks about batching in the sense of breaking your import into chunks you should also look into batching on the JDBC side, where you send multiple statements, or a single statements with multiple sets of parameters in one go to the database, which again should improve performance.
Finally there are often alternatives outside of the Javaverse that might be more suitable for the job. Some databases have tools to load flat files extremely efficient.
I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.
Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html
I am working on providing API and I am storing data by month a database and by date a collection on mongodb.
So I have db db_08_2015 then I have 31 collection from date_01 to date_31
and I have to query data from date 1 to date 10 to have a total money spend so I need to send 31 request like this.
My question is How to get data by 1 request at the time to get a sum before I return to client like sync request into mongo to get result.
Something like I have date_01 = 10 then date_02 = 20 ... and I want to sum it all before return to client.
vertx.eventBus().send("mongodb-persistor", json, new Handler<Message<JsonObject>>() {
#Override
public void handle(Message<JsonObject> message) {
logger.info(message.body());
JsonObject result = new JsonObject(message.body().encodePrettily());
JsonArray r = result.getArray("results");
if (r.isArray()) {
if (r.size() > 0) {
String out = r.get(0).toString();
req.response().end(out);
} else {
req.response().end("{}");
}
} else {
req.response().end(message.body().encodePrettily());
}
}
});
I think in your case you might be better off by having a different approach to model your data.
In terms of analytics I would recommend the lambda architecture approach as quoted below:
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute
the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.
Having the above in mind, why not have an aggregates collection that should hold the aggregated data in the format your query requires, while at the same time keep a raw copy in the format you described.
By having this you will have a view over the data in the required query format and a way to recreate the aggregated data in case your system backfires.
References for diagram and quotes - Lambda Architecture
I am working on a Project in which I need to delete all the columns and its data except for one column and its data in Cassandra using Astyanax client.
I have a dynamic column family like below and we already have couple of million records into that Column Family.
create column family USER_TEST
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and gc_grace = 86400
and column_metadata = [ {column_name : 'lmd', validation_class : DateType}];
I have user_id as the rowKey and other columns I have is something like this -
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,lmd
Now I need to delete all the columns and its data except for a15 column. Meaning, I want to keep a15 column and its data for all the user_id(rowKey) and delete rest of the columns and its data..
I already know how to delete data from Cassandra using Astyanax client for a particular rowKey-
public void deleteRecord(final String rowKey) {
try {
MutationBatch m = AstyanaxConnection.getInstance().getKeyspace().prepareMutationBatch();
m.withRow(AstyanaxConnection.getInstance().getEmp_cf(), rowKey).delete();
m.execute();
} catch (ConnectionException e) {
// some code
} catch (Exception e) {
// some code
}
}
Now how to delete all the columns and its data except for one column for all the users id which is my rowKey...
Any thoughts how this can be done using Astyanax client efficiently?
It appears that Astyanax does not currently support the slice delete functionality that is a fairly recent addition to both the storage engine and the Thrift API. If you look at the thrift API reference: http://wiki.apache.org/cassandra/API10
You see that the delete operation takes a SlicePredicate, which can take either a list of columns or a SliceRange. A SliceRange, could specify all columns greater or less than the column you wanted to keep, so that would allow you to do two slice delete operations to delete all but one of the columns in the row.
Unfortunately, Astyanax only has the ability to delete an entire row, or a defined list of columns and doesn't wrap the full SlicePredicate functionality. So it looks like you have two options:
1) See about sending a raw thrift slice delete, bypassing Astyanax wrapper, or
2) Do a column read, followed by a row delete, followed by a column write. This is not ideally efficient, but if it isn't done too frequently shouldn't be prohibitive.
or
3) Read the entire row and explicitly delete all of the columns other than the one you want to preserve.
I should note that while the storage engine and thrift API both support slice deletes, this is also not yet explicitly supported by CQL.
I filed this ticket to address that last limitation:
https://issues.apache.org/jira/browse/CASSANDRA-6292
I have a program that is used to replicate/mirror the main tables (around 20) from Oracle to MSSQL 2005 via webservice (REST).
The program periodically read XML data from the webservice and convert it to list via jpa entity. This list of entity will store to MSSQL via JPA.
All jpa entity will be provided by the team who create the webservice.
There are two issues that I notice and seems unsolvable after some searching.
1st issue: The performance of inserting/updating via JDBC jpa is very slow, it takes around 0.1s per row...
Doing the same via C# -> datatable -> bulkinsert to new table in DB -> call stored procedure to do mass insert / update base on joins takes 0.01 s for 4000 records.
(Each table will have around 500-5000 records every 5 minutes)
Below shows a snapshot of the Java code that do the task-> persistent library -> EclipseLink JPA2.0
private void GetEntityA(OurClient client, EntityManager em, DBWriter dbWriter){
//code to log time and others
List<EntityA> response = client.findEntityA_XML();
em.setFlushMode(FlushModeType.COMMIT);
em.getTransaction().begin();
int count = 0;
for (EntityA object : response) {
count++;
em.merge(object);
//Batch commit
if (count % 1000 == 0){
try{
em.getTransaction().commit();
em.getTransaction().begin();
commitRecords = count;
} catch (Exception e) {
em.getTransaction().rollback();
}
}
}
try{
em.getTransaction().commit();
} catch (Exception e) {
em.getTransaction().rollback();
}
//dbWriter write log to DB
}
Anything done wrong causing the slowness? How can I improve the insert/update speed?
2nd issue: There are around 20 tables to replicate and I have created the same number of methods similar to above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get the idea...
Is there anyway to generalize the method such that I can throw in any entity?
The performance of inserting/updating via JDBC jpa is very slow,
OR mappers generally are slow for bulk inserts. Per definition. You ant speed? Use another approach.
In general an ORM will not cater fur the bulk insert / stored procedure approach and tus get slaughtered here. You use the wrong appraoch for high performance inserts.
There are around 20 tables to replicate and I have created the same number of methods similar to
above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get
the idea...
Generics. Part of java for some time now.
You can execute SQL, stored procedure or JPQL update all queries through JPA as well. I'm not sure where these objects are coming from, but if you are just migrating one table to another in the same database, you can do the same thing you were doing in C# in Java with JPA.
If you want to process the objects in JPA, then see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html
For #2, change EntityA to Object, and you have a generic method.