Jooq limit before join - java

I have two tables with a 1-n relation, i.e. a table Order which stores orders and a table OrderPosition which stores positions of an order.
When fetching from the DB while joining both tables I want to limit the number of orders. When limiting after the join of the two tables, this will obviously not work, as one order may result multiple records depending on the number of the positions the order has.
This is what I wrote with Jooq now:
final Table<Record> alias = context.select().from(ORDER).limit(1).asTable();
final Result<Record> result =
context
.select()
.from(alias.join(ORDER_POSITION)
.on(ORDER_POSITION.ORDER_ID.eq(alias.field(ORDER.ID))))
.fetch();
This does not seem to limit the number of orders, it returns me more than one order. On the other hand, if I replace the join with a leftJoin, the limitation works just as intended (also for different limit parameters). I used a H2 DB to test the query (not sure if it matters).
I'm aware of the difference of join and leftJoin but should the limitation work in both cases as intended or did I miss something?
This is what is generated by Jooq with a join:
select
"alias_129458832"."ID",
"alias_129458832"."KEY",
...
"PUBLIC"."ORDER_POSITION"."ID",
"PUBLIC"."ORDER_POSITION"."ORDER_ID"
from (
select
"PUBLIC"."ORDER"."ID",
"PUBLIC"."ORDER"."KEY",
...
from "PUBLIC"."ORDER"
limit ?
) "alias_129458832"
join "PUBLIC"."ORDER_POSITION"
on "PUBLIC"."ORDER_POSITION"."ORDER_ID" =
"alias_129458832"."ID"
If I replace the query with a leftJoin, it generates the same query except the join is replaced by 'left outer join'.
What I tested
I run the tests with an in-memory H2 database. I have a couple of orders without any positions, one order with one position and one order with three positions.
With the join and a limit of 1, I got all 4 joined records (which means it returns both orders), with a limit of 0 I got no records. The order has a random key, which is always randomly generated for each test. If I order the orders by this key with
context.select().from(ORDER).orderBy(ORDER.KEY).limit(4).asTable();
and a limit of 4 I got sometimes no records, sometimes one record (the order with one position), sometimes three records (the order with three positions) and sometimes all 4 records (which is both orders).
As mentioned if I replace the join with a leftJoin, I got also of course the orders without any position and but here the number of orders specified with the limit is always correct.

Related

Fastest way to compare millions of rows in one table with millions of rows in another [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to compare two tables with million of records in each table and get match data from the comparison.
To get the match data from both the tables we are first comparing the name in table1 should not be equal to name in table2. Then we are comparing city in table1 should be equal to city in table2 and then finally we are comparing date_of_birth in table1 should be with in +-1 year range of date_of-birth in table2.
A single row in Table 1 can have multiple matches with data in Table 2.
Also for each match I need a unique Record ID and multiple match data of a single Table 1 row must have same unique record ID.
I tried with Java Code, and PL/SQL Procedure but both are taking hours as this involves comparison of millions of data with millions of data. Is there any faster way to do this matching?
"I tried using java by storing data from both tables in list via jdbc connection and then iterating one list with the other. But it was very slow and took many hours to complete, even got time out exception many time."
Congratulations. This is the first step on the road to enlightenment. Databases are much better at handling data than Java. Java is a fine general programming language but databases are optimized for relational data processing: they just do it faster, with less CPU, less memory and less network traffic.
"I also created an sql procedure for the same, it was some what faster
than java program but still took a lot time (couple of hours) to
complete."
You are on the verge of the second step to enlightenment: row-by-row processing (i.e. procedural iteration) is slow. SQL is a set-based paradigm. Set processing is much faster.
To give concrete advice we need some specifics about what you are really doing, but as an example this query would give you the set of matches for these columns in both tables:
select col1, col2, col3
from huge_table_1
INTERSECT
select col1, col2, col3
from huge_table_2
The MINUS operator would give you the rows in huge_table_1 which aren't in huge_table_2. Flip the tables to get the obverse set.
select col1, col2, col3
from huge_table_1
MINUS
select col1, col2, col3
from huge_table_2
Embrace the Joy of Sets!
"we are first comparing the name in huge_table_1 should not be equal
to name in huge_table_2. Then we are comparing city in huge_table_1
should be equal to city in huge_table_2 and then finally we are
comparing date_of_birth in huge_table_1 should be with in +-1 year
range of date_of-birth in huge_table_2"
Hmmm. Starting off with an inequality is usually bad, especially in large tables. The chances are you will have lots of non-matching names with those matching criteria. But you could try something like this:
select * from huge_table_1 ht1
where exists
( select null from huge_table_2 ht2
where ht2.city = ht1.city
and ht1.date_of birth between add_months(ht2.date_of birth, -12)
and add_months(ht2.date_of birth, 12)
and ht2.name != ht1.name)
/
Select data from both tables, sorted by the key fields, then iterate them in parallel and compare. Comparison time should be fast, so total run time should be only slightly more than sum of run time for each ordered query.
UPDATE
New information shows that a partial cross-join of the data is desired:
left.name <> right.name
left.city = right.city
abs(left.birthDate - right.birthDate) <= 1 year
So, given that there is one equality test, you can process the data in chunks, where a chunk is all records with the same city.
Comparison will progress as follows:
Select data from both tables, sorted by city.
Iterate both result sets in parallel.
Load all records from one result set (left) with the next city, i.e. load the next chunk. Store them in memory in a TreeMap<LocalDate, List<Person>>.
Iterate all records from the other result set (right) with the same city, i.e. process the chunk.
For each record in right, find records within 1 year of birthDate by calling subMap(), like this:
Collection<List<Person>> coll =
leftTree.subMap(right.birthDate.minusYears(1), true,
right.birthDate.plusYears(1), true)
.values();
Iterate the records and skip records with same name. These are the left records that "match" the right given record.
If needed, you can flatten that and filter the names using stream:
List<Person> matches = coll.stream()
.flatMap(List::stream)
.filter(p -> ! p.name.equals(right.name))
.collect(Collectors.toList());
Optionally replacing the collect() with the actual processing logic.
When done processing the chunk as described in step 4, i.e. when you see the next city, clear the TreeMap, and repeat from step 3 for the next chunk, aka city.
Advantages of this logic:
Data is only sent once from the database server, i.e. the repetition of data caused by the partial cross-join is eliminated from the relatively slow data link.
The two queries can be sourced from two different databases, if needed.
Memory footprint is kept down, by only retaining data for one city of one of the queries at a time (chunk of left).
Matching logic can be multi-threaded, if needed, for extra performance, e.g.
Thread 1 loads left chunk into TreeMap, and gives it to thread 2 for processing, while thread 1 begins loading next chunk.
Thread 2 iterates right and finds matching records by calling subMap(), iterating the submap, giving matching left and right records to thread 3 for processing.
Thread 3 processes a matching pair.

Optimize oracle query with IN clause

I have two queries where I am using in parameter and I am populating the PreparedStatement using setLong and setString operations.
Query 1
SELECT A, B FROM TABLE1 WHERE A in (SELECT A FROM TABLE2 WHERE C in (?,?,?) )
Query 2
SELECT A, B FROM TABLE1 WHERE A in (?,?)
I am being told that it creates a unique query for each possible set size and pollutes Oracle's SQL cache. Also, oracle could choose different execution plans for each query here as size is not fixed.
What optimizations could be applied to make it better?
Would it be fine if I create in-clause list of size 50 and populate remaining ones using dummy/redundant variables?
If I am not wrong, select-statement in the in-clause will be difficult to optimize unless it is extracted out and used again as a list of statements.
I am being told that it creates a unique query for each possible set size and pollutes Oracle's SQL cache.
This is correct, assuming that the number of items in the IN list can change between requests. If the number of question marks inside the IN list remains the same, there would be no "pollution" of the cache.
Also, oracle could choose different execution plans for each query here as size is not fixed.
That is correct, too. It's a good thing, though.
What optimizations could be applied to make it better? Would it be fine if I create in-clause list of size 50 and populate remaining ones using dummy/redundant variables?
Absolutely. I used this trick many times: rather than generating a list of the exact size, I generated lists of length divisible by a certain number (I used 16, but 50 is also fine). If the size of the actual list wasn't divisible by 16, I added the last item as many times as it was required to reach the correct length.
The only optimization this achieves is the reduction of items in the cache of query plans.

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'
Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.
It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.
In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).
From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.
It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach
There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.
You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

What is the best way to match over 10000 different elements in database?

Ok here's my scenario:
Programming language: Java
I have a MYSQL database which has around 100,000,000 entries.
I have a a list of values in memory say valueList with around 10,000 entries.
I want to iterate through valueList and check whether each value in this list, has a match in the database.
This means I have to make atleast 10,000 database calls which is highly inefficient for my application.
Other way would be to load the entire database into memory once, and then do the comparison in the memory itself. This is fast but needs a huge amount of memory.
Could you guys suggest a better approach for this problem?
EDIT :
Suppose valueList consists of values like :
{"New","York","Brazil","Detroit"}
From the database, I'll have a match for Brazil and Detroit. But not for New and York , though New York would have matched. So the next step is , in case of any remaining non matched values, I combine them to see if they match now. So In this case, I combine New and York and then find the match.
In the approach I was following before( one by one database call) , this was possible. But in case of the approach of creatign a temp table, this wont be possible
You could insert the 10k records in a temporary table with a single insert like this
insert into tmp_table (id_col)
values (1),
(3),
...
(7);
Then join the the 2 tables to get the desired results.
I don't know your table structure, but it could be like this
select s.*
from some_table s
inner join tmp_table t on t.id_col = s.id

How to manage consecutive column values in table rows

A little presentation for what I want to do:
Consider the case where different people from a firm get, once a year, an all expenses paid trip to somewhere. There may be 1000 persons that could qualify for the trip but only 16 places are available.
Each of this 16 spots has an associated index which must be from 1 to 16. The ones on the reservation have index starting from 17.
The first 16 persons that apply get a definite spot on the trip. The rest end up on the reservation list. If one of the first 16 persons cancels, the first person with a reservation gets his place and all the indexes are renumbered to compensate for the person that canceled.
All of this is managed in a Java web app with an Oracle DB.
Now, my problem:
I have to manage the index in a correct way (all sequential, no duplicate indexes), with possible hundreds of people that simultaneously apply for the trip.
When inserting a record in the table for the trip, the way of getting the index is by
SELECT MAX(INDEX_NR) + 1 AS NEXT_INDEX_NR FROM TABLE
and using this as the new index (this is done Java side and then a new query to insert the record). It is obvious why we have multiple spots or reservations with the same index. So, we get, let’s say, 19 people on the trip because 4 of them have index 10, for example.
How can I manage this? I have been thinking of 3 ways so far:
Use an isolation level of Serializable for the DB transactions (don’t like this one);
Insert a record with no INDEX_NR and then have a trigger manage the things… in some way (never worked with triggers before);
Each record also has a UPDATED column. Could I use this in some way? (note that I can’t lose the INDEX_NR since other parts of the app make use of it).
Is there a best way to do this?
Why make it complicated ?
Just insert all reservations as they are entered and insert a timestamp of when they resevered a spot.
Then in you query just use the timestamp to sort them.
There is offcourse the chance that there are people that reserved a spot at the very same millisecond then just use a random method to assign order.
Why do you need to explicitly store the index? Instead you could store each person's order (which never changes) along with an active flag. In your example if person #16 pulls out you simply mark them as inactive.
To compute whether a person qualifies for the trip you simply count the number of active people with order less than that person:
select count(*)
from CompetitionEntry
where PersonOrder < 16
and Active = 1
This approach removes the need for bulk updates to the database (you only ever update one row) and hence mostly mitigates your problem of transactional integrity.
Another way would be to explicitly lock a record on another table on the select.
-- Initial Setup
CREATE TABLE NUMBER_SOURCE (ID NUMBER(4));
INSERT INTO NUMBER_SOURCE(ID) VALUES 0;
-- Your regular code
SELECT ID AS NEXT_INDEX_NR FROM NUMBER_SOURCE FOR UPDATE; -- lock!
UPDATE NUMBER_SOURCE SET ID = ID + 1;
INSERT INTO TABLE ....
COMMIT; -- releases lock!
No other transaction will be able to perform the query on the table NUMBER_SOURCE until the commit (or rollback).
When adding people to the table, give them an ID in such a way that the ID is ascending in the order in which they were added. This can be a timestamp.
Select all the records from the table which qualify, order by ID, and update their INDEX_NR
Select * from table where INDEX_NR <= 16 order by INDEX_NR
Step #2 seems complicated but it's actually quite simple:
update (
select *
from TABLE
where ...
order by ID
)
set INDEX_NR = INDEXSEQ.NEXTVAL
Don't forget to reset the sequence to 1.
Calculate your index in runtime:
CREATE OR REPLACE VIEW v_person
AS
SELECT id, name, ROW_NUMBER() OVER (ORDER BY id) AS index_rn
FROM t_person
CREATE OR REPLACE TRIGGER trg_person_ii
INSTEAD OF INSERT ON v_person
BEGIN
INSERT
INTO t_person (id, name)
VALUES (:new.id, :new.name);
END;
CREATE OR REPLACE TRIGGER trg_person_iu
INSTEAD OF UPDATE ON v_person
BEGIN
UPDATE t_person
SET id = :new.id,
name = :new.name
WHERE id = :old.id;
END;
CREATE OR REPLACE TRIGGER trg_person_id
INSTEAD OF DELETE ON v_person
BEGIN
DELETE
FROM t_person
WHERE id = :old.id;
END;
INSERT
INTO v_person
VALUES (1, 'test', 1)
SELECT *
FROM v_person
--
id name index_rn
1 test 1
INSERT
INTO v_person
VALUES (2, 'test 2', 1)
SELECT *
FROM v_person
--
id name index_rn
1 test 1
2 test 2 2
DELETE
FROM v_person
WHERE id = 1
SELECT *
FROM v_person
--
id name index_rn
2 test 2 1
"I have to manage the index in a correct way (all sequential, no duplicate indexes), with possible hundreds of people that simultaneously apply for the trip.
When inserting a record in the table for the trip, the way of getting the index is by
SELECT MAX(INDEX_NR) + 1 AS NEXT_INDEX_NR FROM TABLE
and using this as the new index (this is done Java side and then a new query to insert the record). It is obvious why we have multiple spots or reservations with the same index."
Yeah. Oracle's MVCC ("snapshot isolation") used incorrectly by someone who shouldn't have been in IT to begin with.
Really, Peter is right. Your index number is, or rather should be, a sort of "ranking number" on the ordered timestamps that he mentions (this holds a requirement that the DBMS can guarantee that any timestamp value appears only once in the entire database).
You say you are concerned with "regression bugs". I say "Why do you need to be concerned with "regression bugs" in an application that is DEMONSTRABLY beyond curing ?". Because your bosses paid a lot of money for the crap they've been given and you don't want to be the pianist that gets shot for bringing the message ?
The solution depends on what you have under your control. I assume that you can change both database and Java code, but refrain from modifying the database scheme since you had to adapt too much Java code otherwise.
A cheap solution might be to add a uniqueness constraint on the pair (trip_id, index_nr) or just on index_nr if there is just one trip. Additionally add a check contraint check(index_nr > 0) - unless index_nr is already unsigned. Everything else is then done in Java: When inserting a new applicant as described by you, you have to add code catching the exception when someone else got inserted concurrently. If some record is updated or deleted, you either have to live with holes between sequence numbers (by selecting the 16 candidates with the lowest index_nr as shown by Quassnoi in his view) or fill them up by hand (similarily to what Aaron suggested) after every update/delete.
If index_nr is mostly used in the application as read-only, a better solution might be to combine the answers of Peter and Quassnoi: Use either a time stamp (automatically inserted by the database by defining the current time as default) or an auto-incremented integer (as default inserted by the database) as value stored in the table. And use a view (like the one defined by Quassnoi) to access the table and the automatically calculated index_nr from Java. But also define both constraints like for the cheap solution.

Categories