Flexible search in database

Flexible search in database - java

I have a legacy system that allows users to manage some entities called "TRANSACTION" in the (MySQL) DB, and mapped to Transaction class in Java. Transaction objects have about 30 fields, some of them are columns in the DB, some of them are joins to another tables, like CUSTOMER, PRODUCT, COMPANY and stuff like that.
Users have access to a "Search" screen, where they are allowed to search using a TransactionId and a couple of extra fields, but they want more flexibility. Basically, they want to be able to search using any field in TRANSACTION or any linked table.
I don't know how to make the search both flexible and quick. Is there any way?. I don't think that having an index for every combination of columns is a valid solution, but full table scans are also not valid... is there any reasonable design? I'm using Criteria to build the queries, but this is not the problem.
Also, I think mysql is not using the right indexes, since when I make hibernate log the sql command, I can almost always improve the response time by forcing an index... I'm starting to use something like this trick adapted to Criteria to force a specific index use, but I'm not proud of the "if" chain. I'm getting something like
if(queryDto.getFirstName() != null){
//force index "IDX_TX_BY_FIRSTNAME"
}else if(queryDto.getProduct() != null){
//force index "IDX_TX_BY_PRODUCT"
}
and it feels horrible
Sorry if the question is "too open", I think this is a typical problem, but I can't find a good approach

Hibernate is very good for writing while SQL still excels on reading data. JOOQ might be a better alternative in your case, and since you're using MySQL it's free of charge anyway.
JOOQ is like Criteria on steroids, and you can build more complex queries using the exact syntax you'd use for native querying. You have type-safety and all features your current DB has to offer.
As for indexes, you need can't simply use any field combination. It's better to index the most used ones and try using compound indexes that cover as many use cases as possible. Sometimes the query executor will not use an index because it's faster otherwise, so it's not always a good idea to force the index. What works on your test environment might not stand still for the production system.

Related

How to optimize one big insert with hibernate

For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();

It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).

Scalable searching algorithm SQL

So I have a list of users stored in a postgres database and i want to search them by username on my (Java) backend and present a truncated list to the user on the front end (like facebook user search). One could of course, in SQL, use
WHERE username = 'john smith';
But I want the search algorithm to be a bit more sophisticated. Beginning with near misses for example
"Michael" ~ "Micheal"
And potentially improving it to use context e.g geographical proximity.
This has been done so many times I feel like I would be reinventing the wheel and doing a bad job of it. Are there libraries that do this? Should this be handled on the backend (so in Java) or in the DB (Postgresql). How do I make this model scalable (i.e use a model in which complexity is easy to add down the road)?

A sophisticated algorithm will not magically appear, you have to implement it. Your last question is whether you should do it in Java or in the database. In overwhelming majority of cases it's better to use the database for queries. Things like "Michael" ~ "Micheal" or spatial queries are standard features in many modern SQL databases. You just have to implement the appropriate SQL query.
Another point is, however, if SQL database is a right tool for "sophisticated queries". You may also consider alternatives like Elasticsearch.

SQL Joins vs Java code?

I have a query like this
Select Folder.name from FROM FolderTable,ValidFolder, ValidFolderGroup, ValidUser,
ValidLocation, ValidDepartment where ValidUser.LocationCode *= ValidLocation.LocationCode
and ValidUser.DepartmentCode *= ValidDepartment.DepartmentCode and Folder.IssueUser =
ValidUser.UserId and ValidFolder.FolderType = Folder.FolderType and
ValidFolderGroup.FolderGroupCode = ValidFolder.FolderGroupCode and
ValidFolderGroup.GroupTypeCode = 13 and (ValidUser.UserId='User' OR
ValidUser.ManagerId='User') and ValidFolderGroup.GroupTypeCode = 13 and
Folder.IssueUser = 'User'
Now here all the table which start with Valid are cache table so these table already contains data .
Suppose if someone using JOOQ or Hibernate which one will be the best option
Use query as written above with all Joins?
Or Use Java code to fulfill the requirement rather than join because as user using Hibernate or JOOQ it already have Java class for the table and Valid table have already all the data ?

Okay, you're probably not going to like this answer, but the best way to do this is not to keep Valid "cached".
The best solution in my opinion would be to use jOOQ (if you prefer DSL) or Hibernate (if you prefer OR mapping) and query the Database every time, and consistently use the DAO pattern.
The jOOQ and Hibernate guys are almost certainly better at SQL than you are. We've used jOOQ and Hibernate in really large enterprise projects, and they both perform exceptionally. Particularly with a good connection pool like BoneCP. If after you've got that setup running, and running well, but still think you may have performance issues, you can always add a cache (like EhCache) afterwards.
Ultimately tho', I'm making a lot of assumptions about your software, namely that
There are more people than you working on it, and
It has to be maintained. If neither of these assumptions are true, then you can safely disregard this answer.

General answer:
Modern databases are incredibly good at optimising your query and choosing the best possible execution plan for you. Given your outer join notation using *=, you're obviously using SQL Server, so that's a pretty good database.
Even if you already have much of the "Valid" data in your application memory, chances are that your database also already has the same data in a buffer cache and thus the database doesn't need to hit the disk again for the various joins in your query.
In fact, depending on the nature of your data, the database might even assess that some of your joins are unneeded (if you have the right meta data, like constraints).
Specific answer:
In your particular case, it looks as though you can indeed strip most of your query yourself and query only the Folder table using search criteria from your application's "Valid" cache. I'm saying that it looks like it, because I don't fully understand the business logic behind those joins and whether they're all modelling 1:1 relationships, or whether removing them will change the semantics of the query.
So, technically, it's possible that you can remove the joins, but if you want to stay on the safe side, just keep things as they are as you migrate to jOOQ or Hibernate.
Alternative 3:
Of course, instead of tampering with this query, you might even be able to remove this query and fetch the Folder.name property already in your previous queries when you load the "Valid" content into memory.

Ever heard of views? Look into them, you'll be amazed.
Apart from that, it's impossible to say what you should do, there's no "best" and you provide way too little information to even make an educated guess about your specific requirements.
But, I'd not hard code things like database IDs in a query that ends up inside any program, far too prone to cause problems in the (near) future.

Result Set to Multi Hash Map

I have a situation here. I have a huge database with >10 columns and millions of rows. I am using a matching algorithm which matches each input records with the values in database.
The database operation is taking lot of time when there are millions of records to match. I am thinking of using a multi-hash map or any resultset alternative so that i can save the whole table in memory and prevent hitting database again....
Can anybody tell me what should i do??

I don't think this is the right way to go. You are trying to do the database's work manually in Java. I'm not saying that you are not capable of doing this, but most databases have been developed for many years and are quite good in doing exactly the thing that you want.
However, databases need to be configured correctly for a given type of query to be executed fast. So my suggestion is that you first check whether you can tweak the database configuration to improve the performance of the query. The most common thing is to add the right indexes to your table. Read How MySQL Uses Indexes or the corresponding part of the manual of your particular database for more information.
The other thing is, if you have so much data storing everything in main memory is probably not faster and might even be infeasible. Not to say that you have to transfer the whole data first.
In any case, try to use a profiler to identify the bottleneck of the program first. Maybe the problem is not even on the database side.

Avoiding N+One selects and Invalid results from eclipselink with batch read

I'm trying to cut down the number of n+1 selects incurred by my application, the application uses EclipseLink as an ORM and in as many places as possible I've tried to add the batch read hint to queries. In a large number of places in the app I don't always know exactly what relationships I'll be traversing (My view displays fields based on user preferences). At that point I'd like to run one query to populate all of those relationships for my objects.
My dream is to call something like ReadAllRelationshipsQuery(Collection,RelationshipName) and populate all of these items so that later calls to:
Collection.get(0).getMyStuff will already be populated and not cause a db query. How can I accomplish this? I'm willing to write any code I need to but I can't find a way that work with the eclipselink framework?
Why don't I just batch read all of the possible fields and let them load lazily? What I've found is that the batch value holders that implement batch reads don't behave well with the eclipselink cache. If a batch read value holder isn't "evaluated" and ends up in the eclipse link cache it can become stale and return incorrect data (This behavior was logged as an eclipselink bug but rejected...)
edit: I found the link to the bug here: https://bugs.eclipse.org/bugs/show_bug.cgi?id=326197
How do I avoid N+1 selects for objects I already have a reference to?

You have three basic ways to load data into objects from a JPA-based solution. These are:
Load dynamically by object traversal (e.g. myObject.getMyCollection().get()).
Load graphs of objects by prefetching dynamically using JPA QL (e.g. FETCH JOINs as described at the Oracle JPA tutorial )
Load by setting the fetch mode ( Is there a way to change the JPA fetch type on a method? )
Each of these has pros and cons.
Loading dynamically by object transversal will generate more (highly targeted queries). These queries are usually small (not large SQL statements, but may load lots of data) and tend to play nicely with a second level cache, but you can get lots and lots of little queries.
Prefetching with JPA QL will give you exactly what you want, but that assumes that you know what you want.
Setting the fetch mode to EAGER will load lots and lots of data for you automatically, but depending on the configuration and usage this may not actually help much (or may make things a lot worse) as you may wind up dragging a LOT of data from the DB into your app that you didn't expect.
Regardless, I highly recommend using p6spy ( http://sourceforge.net/projects/p6spy/ ) in conjunction with any JPA-based application to understand the effects of your tuning.
Unfortunately, JPA makes some things easy and some things hard - mainly, side-effects of your usage. For example, you might fix one problem by setting the fetch mode to eager, and then create another problem where the eager fetch pulls in too much data. EclipseLink does provide tooling to help sort this out ( EclipseLink Performance Tools )
In theory, if you wanted to you could write a generic JavaBean property walker by using something like Apache BeanUtils. Usually just calling a method like size() on a collection is enough to force it to load (although using a collection batch fetch size might complicate things a bit).
One thing to pay particular attention to is the scope of your session and your use of caches (EclipseLink cache).
Something not clear from your post is the scope of a session. Is a session a one shot affair (e.g. like a web page request) or is it a long running thing (e.g. like a classic client/server GUI app)?

It is very difficult to optimize the retrieval of relationships if you do not know what relationships you require.
If you application is requesting what relationships it wants, then you must know at some level which relationships you require, and should be able to optimize these in your query for the objects.
For an overview of relationship optimization techniques see,
http://java-persistence-performance.blogspot.com/2010/08/batch-fetching-optimizing-object-graph.html
For Batch Fetching, there are three types, JOIN, EXISTS, and IN. The problem you outlined of changes to data affecting the original query for cache batched relationships only applies to JOIN and EXISTS, and only when you have a selection criteria based on updateale fields, (if the query you are optimizing is on id, or all instances you are ok). IN batch fetching does not have this issue, so you can use IN batch fetching for all the relationships and not have this issue.
ReadAllRelationshipsQuery(Collection,RelationshipName)
How about,
Query query = em.createQuery("Select o from MyObject o where o.id in :ids");
query.setParameter(ids, ids);
query.setHint("eclipselink.batch", relationship);

If you know all possible relations and the user preferences, why don't you just dynamically build the JPQL string (or Criteria) before executing it?
Like:
String sql = "SELECT u FROM User u"; //use a StringBuilder, this is just for simplity's sake
if(loadAdress)
{
sql += " LEFT OUTER JOIN u.address as a"; //fetch join and left outer join have the same result in many cases, except that with left outer join you could load associations of address as well
}
...
Edit: Since the result would be a cross product, you should then iterate over the entities and remove duplicates.

In the query, use FETCH JOIN to prefetch relationships.
Keep in mind that the resulting rows will be the cross product of all rows selected, which can easily be more work than the N+1 queries.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.