I have a SELECT query with lot of IF conditions, which I can do either in the query itself (takes DB machine's CPU) or I can put it in my java code (takes server machine's CPU).
Is there any preferred approach here (to put conditions in DB Vs in mid-tier)?
UPDATE: My query is a join on more than 2 tables,
and I am using left join to combine and there are some rows which will have corresponding row in 2nd table and some are not.
I need to have some default value for those columns when I don't have corresponding row in 2nd table.
SElECT CASE WHEN t2.col1 is null
then 'default' else t2.col1
END
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
If it's really something that the DB cannot do any faster than the app server, and which actually reduces the load on the DB server if moved to the app server, then I'd move it to the app server.
The reason: if you reach the limits of your hardware, it's much easier to have multiple app servers than to have a clustered database.
However, the second condition above should be tested thoroughly: many things will not reduce (or even increase) the DB load if moved away from the DB.
Update: For the kind of thing you need, I doubt whether the first condition is satisfied - have you tested it? A simple CASE is completely insignificant, unless the condition or the branches contain some very expensive calculations.
Yes, though I would suggest another approach, one that adds no load to the app server and minimal load to the DBMS. It's a little hard to answer the question since you haven't provided a concrete example but I'll give it a shot.
My preferred solution is to get rid of the if conditions totally if you can. At a bare minimum, you can re-jig your database schema to move the cost of calculation away from the select (which happens a lot) and into the insert/update (which happens less often).
That's the normal case, I have seen databases that write more frequently than read, but they're the exception rather than the rule.
By way of example, let's say you store person information and you want to get a list of people whose first name is more than 5 characters long. Don't ask why, I'm the customer, you have to give me what I want :-)
Rather than a monstrous select statement to (possibly) split apart the name and count the characters in it, do that as an insert/update trigger when the data enters the table - that's the only time when the value can change after all.
Put that calculation in another column (indexed) and use that in your select. The cost of the calculation is amortised over al the selects, which will be blindingly fast.
It will take up more storage space but, if you compare the number of database "how can I make this faster?" questions against the number of "how can I use less space?" questions, you'll find the former greatly outweigh the latter.
And, yes, it does mean you store redundant data but the triggers mitigate the possibility of losing ACID properties. It's okay to bend rules if you know the possible consequences and how best to avoid them.
Based on your update, you should put the workload on to the machine where it causes the least impact. That may be the DBMS, it may be the app server, it may even be on the client side (of the app server) itself since that would distribute the cost across a lot of machines rather than concentrating it at a single point.
You should measure, not guess! Set up realistic performance test systems along with realistic production-quality data, then try the different approaches. That's the only real way to be certain.
Related
Spring Boot Query
#Query(value="SELECT *
FROM products p
join product_generic_name pg on pg.id = p.product_generic_name_id
where (p.product_name like %?1%
and p.parent_product_id IS NULL
and p.is_active=true and
(p.is_laboratory is null or p.is_laboratory = false)
)
or (pg.product_generic_name like %?1%
and pg.is_active = true) ",nativeQuery = true)
Page<Products> findByProductNameLikeAndGenericNameLike(String searchText, Pageable pageable);
The product table has over 3 million entries and query takes around 4 min to complete.How to optimize the query performance. I tried indexing product_name column but not much performance improvement.
There are two bottlenecks:
like %?1% -- The leading wildcard means that it must read and check every row.
OR -- This is rarely optimizable.
If like %?1% is only looking at "words", then using a FULLTEXT index and MATCH will run much faster.
OR can be turned into a UNION. It should probably be UNION DISTINCT, assuming that ?1 could be in both the name and the generic_name.
More memory, more regular indexes, etc, etc -- These are not likely to will help. EXPLAIN and other analysis tools tell you what is going on now, not how to improve the query and/or indexes. Defragmentation (in InnoDB) is mostly a waste of time. There is only a narrow range of CPU speeds; this has not changed in over 20 years. Extra cores are useless since MySQL will use only one core for this query. A mere 3M rows means that you probably have more than enough RAM.
Adding an index to product_name won't help as you are doing a like search on it, not an exact match. For your query, you should add indexes to:
is_active
is_laboratory
parent_product_id
However doing a "free text" search with two wildcards, at the start and end of of your search is not a great use case for a relational database. Is this the best design for this problem? If you have 3 million products, could you have a "product_group" which the user has to select to reduce the number of rows to be searched? Or alternatively this is a use case which is a good fit for a full text search engine like ElasticSearch or Solr.
This is a very open end question I would say.
I will try to break it for you.
There are a couple of things that you can unless you already haven't.
Tip 1: Optimize Queries
In many cases database performance issues are caused by inefficient SQL queries. Optimizing your SQL queries is one of the best ways to increase database performance. When you try to do that manually, you’ll encounter several dilemmas around choosing how best to improve query efficiency. These include understanding whether to write a join or a subquery, whether to use EXISTS or IN, and more. When you know the best path forward, you can write queries that improve efficiency and thus database performance as a whole. That means fewer bottlenecks and fewer unhappy end users.
The best way to optimize queries is to use a database performance analysis solution that can guide your optimization efforts by directing you to the most inefficient queries and offering expert advice on how best to improve them.
Tip 2: Improve Indexes
In addition to queries, the other essential element of the database is the index. When done right, indexing can increase your database performance and help optimize the duration of your query execution. Indexing creates a data structure that helps keep all your data organized and makes it easier to locate information. Because it’s easier to find data, indexing increases the efficiency of data retrieval and speeds up the entire process, saving both you and the system time and effort.
Tip 3: Defragment Data
Data defragmentation is one of the best approaches to increasing database performance. Over time, with so much data constantly being written to and deleted from your database, your data can become fragmented. That fragmentation can slow down the data retrieval process as it interferes with a query’s ability to quickly locate the information it’s looking for. When you defragment data, you allow for relevant data to be grouped together and you erase index page issues. That means your I/O related operations will run faster.
Tip 4: Increase Memory
The efficiency of your database can suffer significantly when you don’t have enough memory available for the database to work correctly. Even if it seems like you have a lot of memory in total, you might not be meeting the demands of your database. A good way to figure out if you need more memory is to check how many page faults your system has. When the number of faults is high, it means your hosts are either running low on or completely out of available memory. Increasing your memory allocation will help boost efficiency and overall performance.
Tip 5: Strengthen CPU
A better CPU translates directly into a more efficient database. That’s why you should consider upgrading to a higher-class CPU unit if you’re experiencing issues with your database performance. The more powerful your CPU is, the less strain it’ll have when dealing with multiple requests and applications. When assessing your CPU, you should keep track of all the elements of CPU performance, including CPU ready times, which tell you about the times your system tried to use the CPU, but couldn’t because the resources were otherwise occupied.
Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.
In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?
Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.
It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.
I'm on a project asking for high performances... And I was told to use as a few database calls as possible, and to use more objects in the JVM memory. Right.
So... It didn't shock me at first, but now I'm questioning the approach.
How can I know which is best ?
On the one hand I would have :
- static Map <id1, id2>
- static Map <id2, ObjectX>
Object X
- id2
- map <id1, ObjectY>
Object Y
- id1
So basically, this data structure would help me to get an ObjectY from an id1. And I would be able to send back the whole ObjectX as well when needed.
You gotta know that the structure is filled by a service call (A). Then, updates to objects ObjectY can happen through another service (B). Finally, another service can send back an ObjectX (C). Which makes three services using the data.
On the other hand, I could have :
- db table for ObjectY T1
- db join table associating id1s and id2s T2
- db table for Object X T3
Service A would make an insert in the tables.
Service B would make an update in table T1
Service C would make a join between T2 and T1 to get all ObjectY objects for an ObjectX
In my opinion, the db version is more flexible... I am unsure about the performances, but I would say the db version shouldn't be slower than the "memory" version. Finally, hasn't the "memory" version got some risks ?
I hope it seems obvious to some of you I should choose one version and why... I'm hoping this not to be a debate. I'm looking for ways to know what's quicker...
Retrieving an object stored in memory will take on the order of hundreds of nanoseconds (less if it has been accessed recently and so it in a cache). Of course this latency will vary based on your platform, but this is a ballpark figure for comparison. Retrieving the same information from a database - again it depends on many factors such as whether the database is on the same machine - but it will take on the order of milliseconds at least i.e. tens of thousands of times slower.
Which is quicker - you will need to be more specific, which operations will you be measuring for speed? But the in-memory version will be faster in pretty much all cases. The database version gives different advantages - persistence, access from different machines, transactional commit / rollback - but speed is not one of them, not compared with an in-memory calculation.
Yes, the in-memory version has risks - basically if the machine is powered down (or the process exits for whatever reason...memory corruption, uncaught exception) then the data will be lost (i.e. in-memory solution does not have 'persistence' unlike a database).
What you are doing is building a cache. And it's a hugely popular and proven technique, with many implementations ranging from simple Map usage to full vendor products, support for caching across servers, and all sorts of bells and whistles.
And, done well, you should indeed get all sorts of performance improvements. But the main challenge in caching: how do you know when your cache entry is "stale", i.e. the DB has content that has changed, but your cache doesn't know about it?
You might have an obvious answer here. You might be caching stuff that actually won't change. Cache invalidation is the proper term here - when to refresh it because you know it's stale and you need fresh content.
I think all the trade offs that you rightly recognise are ones you personally need to weigh up, with the extra confidence that you're not "missing something".
One final thought - will you have enough memory to cache everything? Maybe you need to limit it, e.g. to the top 100,000 objects that get requested. Looking at 3rd party caching tools like EHCache, or Guava could be useful:
https://code.google.com/p/guava-libraries/wiki/CachesExplained
Currently working in the deployment of an OFBiz based ERP, we've come to the following problem: some of the code of the framework calls the resultSet.last() to know the total rows of the resultset. Using the Oracle JDBC Driver v11 and v10, it tries to cache all of the rows in the client memory, crashing the JVM because it doesn't have enough heap space.
After researching, the problem seems to be that the Oracle JDBC implements the Scrollable Cursor in the client-side, instead of in the server, by the use of a cache. Using the datadirect driver, that issue is solved, but it seems that the call to resultset.last() takes too much to complete, thus the application server aborts the transaction
is there any way to implemente scrollable cursors via jdbc in oracle without resorting to the datadirect driver?
and what is the fastest way to know the length of a given resultSet??
Thanks in advance
Ismael
"what is the fastest way to know the length of a given resultSet"
The ONLY way to really know is to count them all. You want to know how many 'SMITH's are in the phone book. You count them.
If it is a small result set, and quickly arrived at, it is not a problem. EG There won't be many Gandalfs in the phone book, and you probably want to get them all anyway.
If it is a large result set, you might be able to do an estimate, though that's not generally something that SQL is well-designed for.
To avoid caching the entire result set on the client, you can try
select id, count(1) over () n from junk;
Then each row will have an extra column (in this case n) with the count of rows in the result set. But it will still take the same amount of time to arrive at the count, so there's still a strong chance of a timeout.
A compromise is get the first hundred (or thousand) rows, and don't worry about the pagination beyond that.
your proposed "workaround" with count basically doubles the work done by DB server. It must first walk through everything to count number of results and then do the same + return results. Much better is the method mentioned by Gary (count(*) over() - analytics). But even here the whole result set must be created before first output is returned to the client. So it is potentially slow a memory consuming for large outputs.
Best way in my opinion is select only the page you want on the screen (+1 to determine that next one exists) e.g. rows from 21 to 41. And have another button (usecase) to count them all in the (rare) case someone needs it.