SQL optimization options in Java - java

Let's say I have a basic query like:
SELECT a, b, c FROM x WHERE y=[Z]
In this query, [Z] is a "variable" with different values injected into the query.
Now consider a situation where we want to do the same query with 2 known different values of [Z], say Z1 and Z2. We can make two separate queries:
SELECT a, b, c FROM x WHERE y=Z1
SELECT a, b, c FROM x WHERE y=Z2
Or perhaps we can programmatically craft a different query like:
SELECT a, b, c FROM x WHERE y in (Z1, Z2)
Now we only have one query (1 < 2), but the query construction and result set deconstruction becomes slightly more complicated, since we're no longer doing straightforward simple queries.
Questions:
What is this kind of optimization called? (Is it worth doing?)
How can it be implemented cleanly from a Java application?
Do existing Java ORM technologies help?

What is this kind of optimization called?
I'm not sure if there is a "proper" term for it, but I've heard it called query batching or just plain batching.
(Is it worth doing?)
It depends on:
whether it is worth the effort optimizing the query at all,
the number of elements in the set; i.e. ... IN ( ... ),
the overheads of making a JDBC request versus the costs of query compilation, etc.
But in the right circumstances this is definitely a worthwhile optimization.
How can it be implemented cleanly from a Java application?
It depends on your definition of "clean" :-)
Do existing Java ORM technologies help?
It depends on the specific ORM technology you are talking, but (for example) the Hibernate HQL language supports the constructs that would allow you to do this kind of thing.

An RDBMS can normally return the result of a query with IN in equal or less time than it takes to execute two queries.
If there is no index on column Y, then a full table scan is required. With two queries, two table scans will be performed instead of one.
If there is an index, then the single value in the WHERE clause, or the values in the IN list, are used one at a time to look up the index. When some rows are found for one of the values in the IN list, they are added to the returned result.
So it is better to use the IN predicate from the performance point of view.
When Y represents a column with unique values, then it is easy to decompose the result. Otherwise, there is slightly more work.

I honestly can't say how much of a hit (if any) you will get if you run this two Prepared queries (even using plain JDBC) over combining them with an IN statement.

If you have an array or List of values, you could manually build the prepare statement using JDBC:
// Assuming values is an int[] and conn is a java.sql.Connection
// Also uses Apache Commons StringUtils
StringBuilder query = new StringBuilder("SELECT a, b, c FROM x WHERE y IN (");
query.append(StringUtils.join(Collections.nCopies(values.length, "?"), ',');
query.append(")");
PreparedStatement stmt = conn.prepareStatement(query.toString());
for (int i = 0; i < values.length; i++) {
stmt.setInt(i + 1, values[i]);
}
stmt.execute();
// Get results after this
Note: I haven't actually tested this. In theory, if you used this a lot, you'd generalize this and make it a method.

Note that an "in" (where blah in ( 1, 5, 10 ) ) is the same as writing "where blah = 1 OR blah = 5 OR blah = 10". This is important if you are using, say, Apache Torque which would create lovely prepared statements except in the case of an "in" clause. (That might be fixed by now.)
And the difference in performance that we found between the unprepared in clause and the prepared ORs was huge.
So a number of ORMs handle it, but not all of 'em handle it well. Be sure to examine the queries sent to the database.
And while deconstructing the combined result set from a single query might be more difficult than handling a single result, it's probably a lot easier than trying to combine two result sets from two queries. And probably significantly faster if a lot of duplicates are involved.

Related

How to use SUM inside COALESCE in JOOQ

Given below is a gist of the query, which I'm able to run successfully in MySQL
SELECT a.*,
COALESCE(SUM(condition1 or condition2), 0) as countColumn
FROM table a
-- left joins with multiple tables
GROUP BY a.id;
Now, I'm trying to use it with JOOQ.
ctx.select(a.asterisk(),
coalesce(sum("How to get this ?")).as("columnCount"))
.from(a)
.leftJoin(b).on(someCondition)
.leftJoin(c).on(someCondition))
.leftJoin(d).on(someCondition)
.leftJoin(e).on(someCondition)
.groupBy(a.ID);
I'm having a hard time preparing the coalesce() part, and would really appreciate some help.
jOOQ's API is more strict about the distinction between Condition and Field<Boolean>, which means you cannot simply treat booleans as numbers as you can in MySQL. It's usually not a bad idea to be explicit about data types to prevent edge cases, so this strictness isn't necessarly a bad thing.
So, you can transform your booleans to integers as follows:
coalesce(
sum(
when(condition1.or(condition2), inline(1))
.else_(inline(0))
),
inline(0)
)
But even better than that, why not use a standard SQL FILTER clause, which can be emulated in MySQL using a COUNT(CASE ...) aggregate function:
count().filterWhere(condition1.or(condition2))

The Performance Consequences of Parameterizing Constants in PreparedStatement Queries

When using JDBC's PreparedStatements to query Oracle, consider this:
String qry1 = "SELECT col1 FROM table1 WHERE rownum=? AND col2=?";
String qry2 = "SELECT col1 FROM table1 WHERE rownum=1 AND col2=?";
String qry3 = "SELECT col1 FROM table1 WHERE rownum=1 AND col2=" + someVariable ;
The logic dictates that the value of rownum is always a constant (1 in this example). While the value of col2 is a changing variable.
Question 1: Are there any Oracle server performance advantages (query compilation, caching, etc.) to using qry1 where rownum value is parameterized, over qry2 where rownum's constant value is hardcoded?
Question 2: Ignoring non-performance considerations (such as SQL Injections, readability, etc.), are there any Oracle server performance advantages (query compilation, caching, etc.) to using qry2 over qry3 (in which the value of col2 is explicitly appended, not parameterized).
Answer 1: There are no performance advantages to using qry1 (a softcoded query) over qry2 (a query with reasonable bind variables).
Bind variables improve performance by reducing query parsing; if the bind variable is a constant there is no extra parsing to avoid.
(There are probably some weird examples where adding extra bind variables improves the performance of one specific query. Like with any forecasting program, occasionally if you feed bad information to the Oracle optimizer the result will be better. But it's important to understand that those are exceptional cases.)
Answer 2: There are many performance advantages to using qry2 (a query with reasonable bind variables) over qry3 (a hardcoded query).
Bind variables allow Oracle re-use a lot of the work that goes into query parsing (query compilation). For example, for each query Oracle needs to check that the user has access to view the relevant tables. With bind variables that work only needs to be done once for all executions of the query.
Bind variables also allow Oracle to use some extra optimization tricks that only occur after the Nth run. For example, Oracle can use cardinality feedback to improve the second execution of a query. When Oracle makes a mistake in a plan, for example if it estimates a join will produce 1 row when it really produces 1 million, it can sometimes record that mistake and use that information to improve the next run. Without bind variables the next run will be different and it won't be able to fix that
mistake.
Bind variables also allow for many different plan management features. Sometimes a DBA needs to change an execution plan without changing the text of the query. Features like SQL plan baselines, profiles, outlines, and DBMS_ADVANCED_REWRITE will not work if the query text is constantly changing.
On the other hand, there are a few reasonable cases where it's better to hard-code the queries. Occasionally an Oracle feature like partition pruning cannot understand the expression and it helps to hardcode the value. For large data warehouse queries the extra time to parse a query may be worth it if the query is going to run for a long time anyway.
(Caching is unlikely to affect either scenario. Result caching of a statement is rare, it's much more likely that Oracle will cache only the blocks of the tables used in the statement. The buffer cache probably does not care if those blocks are accessed by one statement many times or by many statements one time)

Can I build a jooq query into an offset?

I'm trying to build something like the following query using the jooq api.
select x.*
from x
offset greatest(0, (select count(*) - 1 from x));
by
select(x.fields()).from(x)
.offset(param(greatest(val(0), select(count().sub(1)).from(x).field(0, Integer.class))))
I'm pretty sure I'm using the offset(Param<Integer>) method incorrectly. It seems to be rendering null for the offset. Is building up offsets like this something that jooq can do? (It seems like the offset method is a bit restricted in what it can do, compared to the rest of the jooq api.)
(I know this query without context seems inefficient, but it's actually what I want to be doing.)
Thanks!
I don't think any database allows you to put a non-constant expression in their OFFSET and LIMIT clauses (it is possible in PostgreSQL, see dsmith's comments). In any case, jOOQ doesn't allow you to do it. You must provide either a constant int value, or a bind variable (a Param).
But you don't really need that feature in your case anyway. Your hypothetical syntax ...
select x.*
from x
offset greatest(0, (select count(*) - 1 from x));
Is equivalent to this:
select x.*
from x
order by <implicit ordering> desc
limit 1;
After all, your query seems to be looking for the last row (by some implicit ordering), so why not just make that explicit?

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'
Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.
It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.
In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).
From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.
It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach
There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.
You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

How to create a mutliple search SQL statement where all the parameters are optional?

I would like to know if there is any smart way of making a SQL statement for a search engine where there are 5 optional parameters. All parameters can be used or only one of them, or a mix of any of them.. This makes up to 3000+ different combinations.
The statement needs to be prepared to avoid SQL injections.
I've looked at this post, but it dosent quite cut.
What I'm looking for is something like,
String sql =SELECT * FROM table WHERE (optional1)=? AND (optional2)=? AND (optional3)=? AND (optional4)=? AND (optional5)=?
prepared.setString(1, optional1)
and so on...
Use your java code to add the options to the where clause based on the presence of your arguments (their length or existence, whichever). That way if the optional parameter is not needed, it won't even be part of your SQL expression. Simple.
#a1ex07 has given the answer for doing this as a single query. Using NULLs and checking for them in each condition.
WHERE
table.x = CASE WHEN #x IS NULL THEN table.x ELSE #x END
or...
WHERE
(#x IS NULL OR table.x = #x)
or...
WHERE
table.x = COALESCE(#x, table.x)
etc, etc.
There is one warning, however; As convenient as it is to make one query to do all of this, All of these answers are sub-optimal. Often they're horednous.
When you write ONE query, only ONE execution plan is created. And that ONE execution plan must be suitable for ALL possible combinations of values. But that fixes which indexes are searched, what order they're searched, etc. It yields the least worst plan for a one-size-fits-all query.
Instead, you're better adding the conditions as necessary. You still parameterise them, but you don't include a condition if you know the parameter is NULL.
This is a good link explaining it further, it's for MS SQL Server specifically but it's generally applicatble to any RDBMS that caches the plans after it compiles the SQL.
http://www.sommarskog.se/dyn-search.html
I believe it should work (haven't tested though)
SELECT * FROM table
WHERE
field1 = CASE
WHEN ? IS NULL THEN field1
ELSE ?
END AND
field2 = CASE
WHEN ? IS NULL THEN field2
ELSE ?
END AND .... etc
//java code
if ([optional1 is required])
{
prepared.setString(1, optional1) ;
prepared.setString(2, optional1) ;
}
else
{
prepared.setNull(1, java.sql.Types.VARCHAR) ;
prepared.setNull(2, java.sql.Types.VARCHAR) ;
}
etc.

Categories