I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.
Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html
Related
I have a large Excel file, with 32k rows, and a Java-Spring code that persist the Excel data to mySQL database. My code works for about 6k row's, but not for the entire Excel due to JPA limitation. I read that it can be done with JPA Pagination but all so far I found only info that collect data from DB (already persisted with data) and render to a UI. The Excel file contain 32k medicines, and this rows will be persisted into DB.
I have this Controller layer with the following method:
public ResponseEntity<ResponseMessage> uploadFile(#RequestParam("file") MultipartFile file,
#RequestParam(defaultValue = "0") int page,
#RequestParam(defaultValue = "6000") int size) {
String message = "";
if (ExcelHelper.hasExcelFormat(file)) {
try {
// the following 6 row are my patetic attempt to resolve with pagination
List<Medicine> medicines = new ArrayList<>();
Pageable paging = PageRequest.of(page, size);
Page<Medicine> pageMedicamente = medicineRepositoryDao.save(paging);
medicines = pageMedicamente.getContent();
medicineService.save(file);
message = "Uploaded the file successfully: " + file.getOriginalFilename();
return ResponseEntity.status(HttpStatus.OK).body(new ResponseMessage(message));
} catch (Exception e) {
message = "Could not upload the file: " + file.getOriginalFilename() + "!";
return ResponseEntity.status(HttpStatus.EXPECTATION_FAILED).body(new ResponseMessage(message));
}
}
And the Repository layer:
#Repository
public interface MedicineRepositoryDao extends JpaRepository<Medicine, Long> {
Page<Medicine> save( Pageable pageable);
}
And also the Service layer:
try {
List<Medicine> medicines = ExcelHelper.excelToMedicine(file.getInputStream());
medicineRepositoryDao.saveAll(medicines);
} catch (IOException e) {
throw new RuntimeException("fail to store excel data: " + e.getMessage());
}
}
I think you have a couple of things mixed up here.
I don't think Spring has any relevant limitation on the number of rows you may persist here. But JPA does. JPA does keeps are reference to any entity that you save in its first level cache. So for large number of rows/entities this hogs memory and also makes some operations slower since entities get looked up or processed one by one.
Pagination is for reading entities, not for saving.
You have a couple of options in this situation.
Don't use JPA. For simply writing data from a file and writing it into a database JPA does hardly offer any benefit. This can almost trivially performed using just a JdbcTemplate or NamedParameterJdbcTemplate and will be much faster, since the overhead of JPA is skipped which you don't benefit from anyway in this scenario. If you want to use an ORM you might want to take a look at Spring Data JDBC which is conceptually simpler and doesn't keep references to entities and therefore should show better characteristics in this scenario. I recommend not using an ORM here since you don't seem to benefit from having entities, so creating them and then having the ORM extract the data from it is really a waste of time.
Break your import into batches. This means you persist e.g. 1000 rows at time, write them to the database and commit the transaction, before you continue with the next 1000 rows. For JPA this is pretty much a necessity for the reasons laid out above. With JDBC (i.e. JdbcTemplate&Co) this probably isn't necessary for 32K rows but might improve performance and might be useful for recoverability if an insert fails. Spring Batch will help you implement that.
While the previous point talks about batching in the sense of breaking your import into chunks you should also look into batching on the JDBC side, where you send multiple statements, or a single statements with multiple sets of parameters in one go to the database, which again should improve performance.
Finally there are often alternatives outside of the Javaverse that might be more suitable for the job. Some databases have tools to load flat files extremely efficient.
I'm trying to find the best/optimal way of loading larger amounts of data form MySQL database in a Spring/Hibernate service.
I pull about 100k records from a 3rd party API (in chunks usually between 300-1000) I then need to pull translations for each record from database since there are 30 languages that means that there will be 30 rows per record so 1000 records from API is 30,000 rows from database.
The records from API come in form of POJO (super small in size) say I get 1000 records I split the list into multiple 100 record lists and then collect id's of each record and select all translations from database for this record. I only need two values from the table which I than add to my POJOs and then I push the POJOs to the next service.
Basically this:
interface i18nRepository extends CrudRepository<Translation, Long> {}
List<APIRecord> records = api.findRecords(...);
List<List<APIRecord>> partitioned = Lists.partition(records, 100); // Guava
for(List<APIRecord> chunk : partitioned) {
List<Long> ids = new ArrayList();
for(APIRecord record : chunk) {
ids.add(record.getId());
}
List<Translation> translations = i18Repository.findAllByRecordIdIn(ids);
for(APIRecord record : chunk) {
for(Translation translation : translations) {
if (translation.getRedordId() == record.getId()) {
record.addTranslation(translation);
}
}
}
}
As far as spring-boot/hibernate properties go I only have default ones set. I would like to make this as efficient, fast and memory lite as possible. One idea I had was to use the lower layer API instead of Hibernate to bypass object mapping.
In my opinion, you should bypass JPA/Hibernate for bulk operations.
There's no way to make bulk operations efficient in JPA.
Consider using Spring's JpaTemplate and native SQL.
I'm using Spark 2.1 to read data from Cassandra in Java.
I tried the code posted in https://stackoverflow.com/a/39890996/1151472 (with SparkSession) and it worked. However when I replaced spark.read() method with spark.sql() one, the following exception is thrown:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `wiki`.`treated_article`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation `wiki`.`treated_article`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
I'm using same spark configuration for both read and sql methods
read() code:
Dataset dataset =
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
sql() code:
spark.sql("SELECT * FROM WIKI.TREATED_ARTICLE");
Spark Sql uses a Catalogue to look up database and table references. When you write in a table identifier that isn't in the catalogue it will throw errors like the one you posted. The read command doesn't require a catalogue since you are required to specify all of the relevant information in the invocation.
You can add entries to the catalogue either by
Registering DataSets as Views
First create your DataSet
spark.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "wiki");
put("table", "treated_article");
}
}).load();
Then use one of the catalogue registry functions
void createGlobalTempView(String viewName)
Creates a global temporary view using the given name.
void createOrReplaceTempView(String viewName)
Creates a local temporary view using the given name.
void createTempView(String viewName)
Creates a local temporary view using the given name
OR Using a SQL Create Statement
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test",
cluster "Test Cluster",
pushdown "true"
)
Once added to the catalogue by either of these methods you can reference the table in all sql calls issued by that context.
Example
CREATE TEMPORARY VIEW words
USING org.apache.spark.sql.cassandra
OPTIONS (
table "words",
keyspace "test"
);
SELECT * FROM words;
// Hello 1
// World 2
The Datastax (My employer) Enterprise software automatically registers all Cassandra tables by placing entries in the Hive Metastore used by Spark as a Catalogue. This makes all tables accessible without manual registration.
This method allows for select statements to be used without an accompanying CREATE VIEW
I cannot think of a way to make that work off the top of my head. The problem lies in that Spark doesn't know the format to try, and the location that this would be specified is taken by the keyspace. The closest documentation for something like this that I can find is here in the DataFrames section of the Cassandra connector documentation. You can try to specify a using statement, but I don't think that will work inside of a select. So, your best bet beyond that is to create a PR to handle this case, or stick with the read DSL.
I have an application using hibernate. One of its modules calls a native SQL (StoredProc) in batch process. Roughly what it does is that every time it writes a file it updates a field in the database. Right now I am not sure how many files would need to be written as it is dependent on the number of transactions per day so it could be zero to a million.
If I use this code snippet in while loop will I have any problems?
#Transactional
public void test()
{
//The for loop represents a list of records that needs to be processed.
for (int i = 0; i < 1000000; i++ )
{
//Process the records and write the information into a file.
...
//Update a field(s) in the database using a stored procedure based on the processed information.
updateField(String.valueOf(i));
}
}
#Transactional(propagation=propagation.MANDATORY)
public void updateField(String value)
{
Session session = getSession();
SQLQuery sqlQuery = session.createSQLQuery("exec spUpdate :value");
sqlQuery.setParameter("value", value);
sqlQuery.executeUpdate();
}
Will I need any other configurations for my data source and transaction manager?
Will I need to set hibernate.jdbc.batch_size and hibernate.cache.use_second_level_cache?
Will I need to use session flush and clear for this? The samples in the hibernate tutorial is using POJO's and not native sql so I am not sure if it is also applicable.
Please note another part of the application is already using hibernate so as much as possible I would like to stick to using hibernate.
Thank you for your time and I am hoping for your quick response. If it is also possible could code snippet would really be useful for me.
Application Work Flow
1) Query Database for the transaction information. (Transaction date, Type of account, currency, etc..)
2) For each account process transaction information. (Discounts, Current Balance, etc..)
3) Write the transaction information and processed information to a file.
4) Update a database field based on the process information
5) Go back to step 2 while their are still accounts. (Assuming that no exception are thrown)
The code snippet will open and close the session for each iteration, which definitely not a good practice.
Is it possible, you have a job which checks how many new files added in the folder?
The job should run say every 15/25 minutes, checking how much files are changed/added in last 15/25 minutes and updates the database in batch.
Something like that will lower down the number of open/close session connections. It should be much faster than this.
I have a program that is used to replicate/mirror the main tables (around 20) from Oracle to MSSQL 2005 via webservice (REST).
The program periodically read XML data from the webservice and convert it to list via jpa entity. This list of entity will store to MSSQL via JPA.
All jpa entity will be provided by the team who create the webservice.
There are two issues that I notice and seems unsolvable after some searching.
1st issue: The performance of inserting/updating via JDBC jpa is very slow, it takes around 0.1s per row...
Doing the same via C# -> datatable -> bulkinsert to new table in DB -> call stored procedure to do mass insert / update base on joins takes 0.01 s for 4000 records.
(Each table will have around 500-5000 records every 5 minutes)
Below shows a snapshot of the Java code that do the task-> persistent library -> EclipseLink JPA2.0
private void GetEntityA(OurClient client, EntityManager em, DBWriter dbWriter){
//code to log time and others
List<EntityA> response = client.findEntityA_XML();
em.setFlushMode(FlushModeType.COMMIT);
em.getTransaction().begin();
int count = 0;
for (EntityA object : response) {
count++;
em.merge(object);
//Batch commit
if (count % 1000 == 0){
try{
em.getTransaction().commit();
em.getTransaction().begin();
commitRecords = count;
} catch (Exception e) {
em.getTransaction().rollback();
}
}
}
try{
em.getTransaction().commit();
} catch (Exception e) {
em.getTransaction().rollback();
}
//dbWriter write log to DB
}
Anything done wrong causing the slowness? How can I improve the insert/update speed?
2nd issue: There are around 20 tables to replicate and I have created the same number of methods similar to above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get the idea...
Is there anyway to generalize the method such that I can throw in any entity?
The performance of inserting/updating via JDBC jpa is very slow,
OR mappers generally are slow for bulk inserts. Per definition. You ant speed? Use another approach.
In general an ORM will not cater fur the bulk insert / stored procedure approach and tus get slaughtered here. You use the wrong appraoch for high performance inserts.
There are around 20 tables to replicate and I have created the same number of methods similar to
above, basically copying above method 20 times and replace EntityA with EntityB and so on, you get
the idea...
Generics. Part of java for some time now.
You can execute SQL, stored procedure or JPQL update all queries through JPA as well. I'm not sure where these objects are coming from, but if you are just migrating one table to another in the same database, you can do the same thing you were doing in C# in Java with JPA.
If you want to process the objects in JPA, then see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html
For #2, change EntityA to Object, and you have a generic method.