Partitioning with Hibernate - java

We have a requirement to delete data in the range of 200K from database everyday. Our application is Java/Java EE based using Oracle DB and Hibernate ORM tool.
We explored various options like
Hibernate batch processing
Stored procedure
Database partitioning
Our DBA suggests database partitioning is the best way to go, so we can easily recreate and drop the partitioned table everyday. Now the issue is we have 2 kinds of data, one which we want to delete everyday and the other which we want to keep it. Suppose this data is stored in table "Trade". Now with partitioning, we have 2 tables "Trade". We have already existing Hibernate based DAO layer to fetch/store trades from/to DB. When we decide to partition the database, how can we control the trades to go in which of the two tables through hibernate. Basically I want , the trades need to be deleted by end of the day, to go in partitioned table and the trades I want to keep, in main table. Please suggest how can this be possible with Hibernate. We may add an additional column to identify the trades to be deleted but how can we ensure these trades should go to partitioned trade table using hibernate.
I would appreciate if someone can suggest any better approach in case we are on wrong path.

When we decide to partition the database, how can we control the trades to go in which of the two tables through hibernate.
That's what Hibernate Shards is for.

You could use hibernate inheritance strategy.
If you know at object creation that it will be deleted by the end of the day, you can create a VolatileTrade that is a subclass of Trade (with no other attribute). Use the 'table per concrete class' strategy (section 9.1.5 of hibernate 3.3 reference documentation) for the mapping.
(I think i would do an abstract superclass Trade, and two concrete subclasses : PersistentTrade and VolatileTrade, so that if you have some other classes that you know will reference only PersistentTrade (or Volatile), you can constrain that in your code. If you had used the Trade superclass as the PersistentTrade, you won't be able to enforce that.)
The volatile trade will go in one table and the 'persitent' trade will go in another table.
Be aware that you won't be able to set a fk constraint on any Trade (persistent and volatile) from other table in the db.
Then you just have to clear the table when you want.
Be careful to define a locking mechanism so that no other thread will try to write data to the table during the drop and the create (if you use that). That won't be an easy task, and doing it rightfully might impact the performance of all operation inserting data in the table (as it will require acquiring the lock).
Wouldn't it be more easy to truncate the table ?

Related

How to bind entity to different tables with SpringJPA/SpringBoot

I want to design a system. There are different customers using this system. I need to create the duplicated tables for every customer. For example, I have a table Order, then all of order records for customerA are in table Order_A, as well as customerB data are in table Order_B. I can distinct different customers from session, but how can I let Spring JPA to reflect the RDS table data to Java object?
I know 2 solutions, but both are not satisfied.
Consider to use Mybatis because it supports load SQL from xml file and parameters inside SQL;
Consider to use org.hibernate.EmptyInterceptor. This is my current implement in my project. For every entity, I must define a subclass of it. It can update the SQL before Hibernate's execution.
However, both are not graceful. I prefer the better solution.

How to optimize one big insert with hibernate

For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();
It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).

Audit Using Hibernate Envers

I am using hibernate envers for making history of my data, it's working fine as well. The problem here is, it's creating duplicate data in history table i.e. creating data in history table whether there is any change in audited table or not. I want only changed fields stored in my history table. I am new to hibernate envers. What can I do?
If I understand your question correctly, Envers doesn't work that way, at least not out of the box.
Envers is a commit-snapshot auditing solution where just before commit, it examines audited entity state and determines whether any attributes have been modified or not and records a snapshot of all audited fields of that entity at that point in time. This means that the only time an audit entry isn't created is when no attributes have been modified.
But it also uses the snapshot approach because it fits really well with the Query API.
Consider the inefficiency that would occur if a query to find an entity at a given revision had to read all rows from that revision back to the beginning of time, iterating each row and merging the column state captured to just instantiate a single row result-set.
With the snapshot approach, it boils down to the following query, no loops or iterative work.
SELECT e FROM AuditedEntity e WHERE e.revisionNumber = :revisionNumber
This is far more efficient from a I/O perspective both with the database reading the data pages and the network for streaming a single row result-set rather than multi-row result-set to the client.
I'd say in this case, the saying "space is cheap" really holds true when you compare that against the cost and inefficiencies your application would face doing it any other way.
If this is something you'd like Envers to support, perhaps via some user configured strategy then you're welcomed to log a new feature request in JIRA for hibernate-envers and I can take a look at its feasibility.
I had similar problem.
In my case the error was that audited field had higher precision than the database field. Please see my reply to another thread: https://stackoverflow.com/a/65844949/13381019

LinkedList with Serialization in Java

I'm getting introduced to serialization and ran into some problems when pairing it with LinkedList
Consider i have the following table:
CREATE TABLE JAVA_OBJECTS (
ID BIGINT NOT NULL UNIQUE AUTO_INCREMENT,
OBJ_NAME VARCHAR(50),
OBJ_VALUE BLOB
);
And i'm planning to store 3 object types - so the table may look like so -
ID OBJ_NAME OBJ_VALUE
============================
1 Class1 BLOB
2 Class2 BLOB
3 Class1 BLOB
4 Class3 BLOB
5 Class3 BLOB
And i'll use 3 different LinkedList's to manage these objects..
I've been able to implement LoadFromTable() and StoreIntoTable(Class1 obj1).
My question is - if i change an attribute for a Class2 object in LinkedList<Class2>, how do i effect the change in the DB for this individual item? Also take into account that the order of the elements in LinkedList may change..
Thanks : )
* EDIT
Yes, i understand that i'll have to delete/update a row in my DB table. But how do i keep track of WHICH row to update? I'm only storing the objects in the List, not their respective IDs in the table.
You'll have to store their IDs in the objects you are storing. However, I would suggest not trying to roll your own ORM system, and instead use something like Hibernate.
If you change an attribute in a an object or the order of items. You will have to delete that row and insert the updated list again.
How do i effect the change in the DB for this individual item?
I hope I get you right. The SQL update and delete statements allow you to add a WHERE clause in which you chose the ID of the row to update.
e.g.
UPDATE JAVA_OBJECTS SET OBJ_NAME ="new name" WHERE ID = 2
EDIT:
To prevent problems with your Ids you could wrap you object
class Wrapper {
int dbId;
Object obj;
}
And add them instead of the 'naked' object into your LinkedList
You can use AUTO_INCREMENT attribute for your table and then use the mysql_insert_id() function to retrieve the id assigned to the row added/updated by the last INSERT/UPDATE statement. Along with this maintain a map (eg a HashMap) from the java object to the Id. Using this map you can keep track of which row to delete/update.
Edit: See the answer to this question as well.
I think the real problem here is, that you mix and match different levels of abstraction. By storing serialized Java objects into a relational database as BLOBs you have to consider several drawbacks:
You loose interoperability. Applications written in other languages than Java are not able to read the data back. Even other Java applications have to have the class files of the serialized classes in their classpath.
Changing the class definitions of the stored classes will end up in maintenance nightmares.
You give up the advantages of a relational database. Serialization hides the actual data from the database. So the database is presented only with a black box. You are unable to execute any meaningfull query against the real data. All what you have is the ID and block of bytes.
You have to implement low level data handling by yourself. Actually the database is made to handle your data effectively, but because of serialization you hinder it doing its job. So you are on your own and you are running into that problem right now.
So in most cases you benifit from separation of concerns and using the right tool for a job.
Here are some suggestions:
Separate the internal data handling inside your application from persistent storage. Design your database schema in a way to enable the built-in database features to handle the data efficently. In case of a relational database like MySQL you can choose from different technologies like plain JDBC, object relational mappers like JPA or simple mappers like MyBatis. Separation here means to avoid to contaminate the database with implementation specific concerns.
If you have for example in your Java application a List of Person instances and each Person consists of a name and an age. Then you would represent that list in a relational database as a table consisting of a VARCHAR field for the name and a numeric field for the age and maybe a third field for a unique key. Then the database is able to do what it can do best: managing large amounts of data.
Inside your application you typically separate the persistent layer from the rest of your program containing the code to communicate with the database.
In some use cases a relational database may not be the appropiate tool. Maybe in a single user desktop application with a small set of data it may be the best to simply serialize your Person list into a plain file and read it back at the next start up.
But there exists other alternatives to persist your data. Maybe some kind of object oriented database is the right tool. In particular I have experiences with Fast Objects. As a simplification it is serialization on steroids. There is no need for a layer like JPA or JDBC between your application and your database. You are able to store the class instances directly into the database. But unlike the relational database with its BLOB field, the OODB knows your classes and the actual data and can benefit from that.
Another alternative may be JDBM or Berkeley DB.
So separation of concerns and choosing the right persistence strategy (and using it the right way) is a key concern for the success of your project. But doing it right is hard even for experienced developers.

JPA or Hibernates With Oracle Table Partitions?

I need to use an Entity framework with my application, and I have used table - partitions in Oracle database. With simple JDBC, I am able to select data from a specific partition. But I don't know whether I can do the same with hibernate or Eclipse link (JPA). If someone knows how to do that, please do let me know.
usually the select statement in JDBC - SQL is,
select * from TABLE_NAME partiton(PARTITON_NAME) where FIELD_NAME='PARAMETER_VALUE';
How can I do the same with Hibernates or JPA?
Please share at least a link for learning sources.
Thanks!!!
JPA or any other ORM framework does not support Oracle partition tables natively (atleast in my knowledge).
There are different possible solutions though, depending on the nature of your problem:
Refactor your classes so that data that needs to be treated differently in real-life, belongs in a separate class. Sometimes this is called vertical partitioning (partitions are not obtained across rows, rather across columns).
Use Oracle partition tables underneath and use native SQL queries or stored procedures from JPA. This is just a possibile solution (I haven't attempted this).
Use Hibernate Shards. Although the typical use case for Hibernate Shards is not for a single database, it presents a singular view of distributed databases to an application developer.
Related:
JPA Performance, Don't Ignore the Database
EclipseLink supports partitioning or sharding with different options.
You can find more about this and examples here:
http://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Advanced_JPA_Development/Data_Partitioning
Table partitioning is data organization on physical level. In a word, partitioning is a poor man index. Like the later, it is supposed to be entirely transparent to the user. A SQL query is allowed to refer to the entire table, but not partition. Then, it is query optimizer job to decide if it can leverage a certain partition, or index.

Categories