Java SQL [ limit | rownum ... ] uniformization API - java

I'm faced with the task of adding "DB pagination" while executing the query and not while serving the user with results.
The problem is that each DB engine has it's own approach to SQL for subtracting a set of results from the entire set. To be more precise, if one wants results N to N+d using:
mysql ---> select X from Y where Z limit N, d
oracle --> select * ( select X, ROW_NUMBER() OVER ( ORDER BY Y.colName ) from Y where Z ) WHERE R BETWEEN N and N+d
For now we provide support for Oracle & MySQL, but one can never know the clients requests, so I am trying to have a general implementation available, therefore I am looking for a library that provides some functionality like this:
qWithSubset = performLimitationOverQuery( qNoSubset, offset, amount, sortOnSet )
Any suggestion is welcome. Thank you.

Any JPA implementation, such as Hibernate for example, provides a level of abstraction between your code and database.
To be fair, it does much more than dealing with pagination. You could, however, look at and / or borrow its implementation of database "dialects" to deal with pagination without changing the way you're dealing with database in general if you were so inclined.

Related

How to use SUM inside COALESCE in JOOQ

Given below is a gist of the query, which I'm able to run successfully in MySQL
SELECT a.*,
COALESCE(SUM(condition1 or condition2), 0) as countColumn
FROM table a
-- left joins with multiple tables
GROUP BY a.id;
Now, I'm trying to use it with JOOQ.
ctx.select(a.asterisk(),
coalesce(sum("How to get this ?")).as("columnCount"))
.from(a)
.leftJoin(b).on(someCondition)
.leftJoin(c).on(someCondition))
.leftJoin(d).on(someCondition)
.leftJoin(e).on(someCondition)
.groupBy(a.ID);
I'm having a hard time preparing the coalesce() part, and would really appreciate some help.
jOOQ's API is more strict about the distinction between Condition and Field<Boolean>, which means you cannot simply treat booleans as numbers as you can in MySQL. It's usually not a bad idea to be explicit about data types to prevent edge cases, so this strictness isn't necessarly a bad thing.
So, you can transform your booleans to integers as follows:
coalesce(
sum(
when(condition1.or(condition2), inline(1))
.else_(inline(0))
),
inline(0)
)
But even better than that, why not use a standard SQL FILTER clause, which can be emulated in MySQL using a COUNT(CASE ...) aggregate function:
count().filterWhere(condition1.or(condition2))

Can I build a jooq query into an offset?

I'm trying to build something like the following query using the jooq api.
select x.*
from x
offset greatest(0, (select count(*) - 1 from x));
by
select(x.fields()).from(x)
.offset(param(greatest(val(0), select(count().sub(1)).from(x).field(0, Integer.class))))
I'm pretty sure I'm using the offset(Param<Integer>) method incorrectly. It seems to be rendering null for the offset. Is building up offsets like this something that jooq can do? (It seems like the offset method is a bit restricted in what it can do, compared to the rest of the jooq api.)
(I know this query without context seems inefficient, but it's actually what I want to be doing.)
Thanks!
I don't think any database allows you to put a non-constant expression in their OFFSET and LIMIT clauses (it is possible in PostgreSQL, see dsmith's comments). In any case, jOOQ doesn't allow you to do it. You must provide either a constant int value, or a bind variable (a Param).
But you don't really need that feature in your case anyway. Your hypothetical syntax ...
select x.*
from x
offset greatest(0, (select count(*) - 1 from x));
Is equivalent to this:
select x.*
from x
order by <implicit ordering> desc
limit 1;
After all, your query seems to be looking for the last row (by some implicit ordering), so why not just make that explicit?

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?
Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).
You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END
Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).
If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

Hibernate getting position of a row in a result set

I need to get an equivalent to this SQL that can be run using Hibernate. It doesn't work as is due to special characters like #.
SELECT place from (select #curRow := #curRow + 1 AS place, time, id FROM `testing`.`competitor` JOIN (SELECT #curRow := 0) r order by time) competitorList where competitorList.id=4;
My application is managing results of running competitions. The above query is selecting for a specific competitor, it's place based on his/her overall time.
For simplicity I'll only list the COMPETITOR table structure (only the relevant fields). My actual query involves a few joins, but they are not relevant for the question:
CREATE TABLE competitor {
id INT,
name VARCHAR,
time INT
}
Note that competitors are not already ordered by time, thus, the ID cannot be used as rank. As well, it is possible to have two competitors with the same overall time.
Any idea how I could make this work with Hibernate?
Hard to tell without a schema, but you may be able to use something like
SELECT COUNT(*) FROM testing ts
WHERE ts.score < $obj.score
where I am using the $ to stand for whatever Hibernate notation you need to refer to the live object.
I couldn't find any way to do this, so I had to change the way I'm calculating the position. I'm now taking the top results and am creating the ladder in Java, rather than in the SQL query.

SQL optimization options in Java

Let's say I have a basic query like:
SELECT a, b, c FROM x WHERE y=[Z]
In this query, [Z] is a "variable" with different values injected into the query.
Now consider a situation where we want to do the same query with 2 known different values of [Z], say Z1 and Z2. We can make two separate queries:
SELECT a, b, c FROM x WHERE y=Z1
SELECT a, b, c FROM x WHERE y=Z2
Or perhaps we can programmatically craft a different query like:
SELECT a, b, c FROM x WHERE y in (Z1, Z2)
Now we only have one query (1 < 2), but the query construction and result set deconstruction becomes slightly more complicated, since we're no longer doing straightforward simple queries.
Questions:
What is this kind of optimization called? (Is it worth doing?)
How can it be implemented cleanly from a Java application?
Do existing Java ORM technologies help?
What is this kind of optimization called?
I'm not sure if there is a "proper" term for it, but I've heard it called query batching or just plain batching.
(Is it worth doing?)
It depends on:
whether it is worth the effort optimizing the query at all,
the number of elements in the set; i.e. ... IN ( ... ),
the overheads of making a JDBC request versus the costs of query compilation, etc.
But in the right circumstances this is definitely a worthwhile optimization.
How can it be implemented cleanly from a Java application?
It depends on your definition of "clean" :-)
Do existing Java ORM technologies help?
It depends on the specific ORM technology you are talking, but (for example) the Hibernate HQL language supports the constructs that would allow you to do this kind of thing.
An RDBMS can normally return the result of a query with IN in equal or less time than it takes to execute two queries.
If there is no index on column Y, then a full table scan is required. With two queries, two table scans will be performed instead of one.
If there is an index, then the single value in the WHERE clause, or the values in the IN list, are used one at a time to look up the index. When some rows are found for one of the values in the IN list, they are added to the returned result.
So it is better to use the IN predicate from the performance point of view.
When Y represents a column with unique values, then it is easy to decompose the result. Otherwise, there is slightly more work.
I honestly can't say how much of a hit (if any) you will get if you run this two Prepared queries (even using plain JDBC) over combining them with an IN statement.
If you have an array or List of values, you could manually build the prepare statement using JDBC:
// Assuming values is an int[] and conn is a java.sql.Connection
// Also uses Apache Commons StringUtils
StringBuilder query = new StringBuilder("SELECT a, b, c FROM x WHERE y IN (");
query.append(StringUtils.join(Collections.nCopies(values.length, "?"), ',');
query.append(")");
PreparedStatement stmt = conn.prepareStatement(query.toString());
for (int i = 0; i < values.length; i++) {
stmt.setInt(i + 1, values[i]);
}
stmt.execute();
// Get results after this
Note: I haven't actually tested this. In theory, if you used this a lot, you'd generalize this and make it a method.
Note that an "in" (where blah in ( 1, 5, 10 ) ) is the same as writing "where blah = 1 OR blah = 5 OR blah = 10". This is important if you are using, say, Apache Torque which would create lovely prepared statements except in the case of an "in" clause. (That might be fixed by now.)
And the difference in performance that we found between the unprepared in clause and the prepared ORs was huge.
So a number of ORMs handle it, but not all of 'em handle it well. Be sure to examine the queries sent to the database.
And while deconstructing the combined result set from a single query might be more difficult than handling a single result, it's probably a lot easier than trying to combine two result sets from two queries. And probably significantly faster if a lot of duplicates are involved.

Categories