rewrite Hibernate query without huge list parameter - java

In my database I have a zip table with a code column. The user can upload a list of Zip codes and I need to figure out which ones are already in the database. Currently, I do this using the following Hibernate query (HQL):
select zip.code from Zip zip
where zip.code in (:zipCodes)
The value of the :zipCodes parameter is the list of codes uploaded by the user. However, in the version of Hibernate I'm using there's a bug which limits the size of such list parameters and on occasions we're exceeding this limit.
So I need to find another way to figure out which of the (potentially very long) list of Zip codes are already in the database. Here are a few options I've considered
Option A
Rewrite the query using SQL instead of HQL. While this will avoid the Hibernate bug, I suspect the performance will be terrible if there are 30,000 Zip codes that need to be checked.
Option B
Split the list of Zip codes into a series of sub-lists and execute a separate query for each sub-list. Again, this will avoid the Hibernate bug, but performance will likely still be terrible
Option C
Use a temporary table, i.e. insert the Zip codes to be checked into a temporary table, then join that to the zip table. It seems the querying part of this solution should perform reasonably well, but the creation of the temporary table and insertion of up to 30,000 rows will not. But perhaps I'm not going about it the right way, here's what I had in mind in pseudo-Java code
/**
* Indicates which of the Zip codes are already in the database
*
* #param zipCodes the zip codes to check
* #return the codes that already exist in the database
* #throws IllegalArgumentException if the list is null or empty
*/
List<Zip> validateZipCodes(List<String> zipCodes) {
try {
// start transaction
// execute the following SQL
CREATE TEMPORARY TABLE zip_tmp
(code VARCHAR(255) NOT NULL)
ON COMMIT DELETE ROWS;
// create SQL string that will insert data into zip_tmp
StringBuilder insertSql = new StringBuilder()
for (String code : zipCodes) {
insertSql.append("INSERT INTO zip_tmp (code) VALUES (" + code + ");")
}
// execute insertSql to insert data into zip_tmp
// now run the following query and return the result
SELECT z.*
FROM zip z
JOIN zip_tmp zt ON z.code = zt.code
} finally {
// rollback transaction so that temporary table is removed to ensure
// that concurrent invocations of this method operate do not interfere
// with each other
}
}
Is there a more efficient way to implement this than in the pseudo-code above, or is there another solution that I haven't thought of? I'm using a Postgres database.

Load all the Zip codes in the database to a List. And on the user inputed list of Zip codes do a removeAll(databaseList).
Problem solved!

Suppose you "validate" 1000 codes against a table of 100000 records in which the code is the primary key and has a clustered index.
Option A is not an improvement, Hibernate is going to build the same SELECT ... IN ... you could write on your own.
Option B, as well as your current query, might fail to use the index.
Option D might be good if you are sure the zipcodes don't change at arbitrary times, which is unlikely, or if you can recover from trying to process existing codes.
Option C (Creating a temp table, issuing 1000 INSERT statements and joining 1000 rows against 100000 in a single SELECT) isn't competitive with just issuing 1000 simple and index-friendly queries for a single new code each:
SELECT COUNT(*) FROM Zip WHERE Zip.code = :newCode

Option D:
Loading all existing zip codes from the database (pagination?) and make the compare in your application.
Regarding your Option A:
I remember a limitation of the SQL query lenght but that was on DB2, I don't know if there is a limit on PostgreSQL.

There are around 45'000 Zip Codes in the US and the seem to be updated anualy. If this is an anual job, dont write it in java. Create a sql script which loads the zip codes into a a new table and write an insert statement with
insert XXX into zip where zip.code not in (select code from ziptemp)
Have your operation guys run this two line SQL script once a year and dont buy yourself with this in the java code. Plus if you keep this out of java, you can basically take any approach, because no one cares if this runs for thirty minutes in offpeak times.
divide et impera

Have you tryed to use subqueries IN ?
http://docs.jboss.org/hibernate/orm/3.5/api/org/hibernate/criterion/Subqueries.html
would be something like this
DetachedCriteria dc = DetachedCriteria.forClass(Zip.class, "zz");
//add restrictions for the previous dc
Criteria c = session.createCriteria(Zip.class, "z");
c.add(Subqueries.in("z.code" dc));
sry if I mistaken the code, its beeing a while since I dont use Hibernate

Related

Checking Huge Volume of records in Oracle DB .shellscript job times out

I am working on a daily automated job which checks the records in a text file against oracle database.
We receive the text file everyday from an external team which contains records around 100,000. The text file will be in unix format which has 6 columns seperated by | symbol.
eg,
HDR 1
home/sample/file|testvalue1|testvalue2|testval3|testval4|testval5
TRL
I need to check whether the values in testval3 and testval5 exists in my table in oracle database. The table has around 10 million records.
I am currently processing it through shellscript. Inside the shellscript, i am reading the text file and iterating through each line in a loop. Inside the loop I am passing the value from each line and running the query against DB. If the records don’t exist in DB, I have to output those to a csv file. The below query is used:
select ‘testval3’,’testval5’ from dual
where not exists (select primarykeycolumn
from mytable where mycolumn=testval3 and mycolumn2=testval5)
Since the input file has 100000 entries, my loop will run the query 100000 times and every time it will check the table with 10 million records. This is making my batch job to run for many hours and I have to terminate it. Is there a better way to handle this situation? I can also use java if there is no better way to do this via shellscript.
Below is one of the simple solution that will ensure that timeout will be not there and even you don't need to scan the millions of record for 100K times.
One time setup:
Create a staging Temporary table:
create table a_staging_table(
testvalue1 varchar2(255),
testvalue2 varchar2(255),
testval3 varchar2(255),
testval4 varchar2(255),
testval5 varchar2(255)
);
----Recurring process
Load your "CSV/TEXT" data into staging table:
some_file_name.ctl: this file contains below load data command.
load data
INFILE 'home/sample/file.csv'
INTO TABLE a_staging_table
APPEND FIELDS TERMINATED BY '|'
(testvalue1,testvalue2,testval3,testval4,testval5);
Now, run the SQL loader to load your data into a staging table form .
sqlldr userid=dbUserName/dbUserPassword control=some_file_name.ctl log=some_file_name.log
Your data is loaded into staging tables. Now join the staging table and your_original_table to identify the record that does not exist.
First Way:
Spool the output from the below SQL using SQL*PLUS:
select s.testval3,testval5
from (select distinct testval3,testval5
from a_staging_table) s
where not exists
(select 1
from your_original_table
where mycolumn1=s.testval3
and mycolumn2=s.testval5);
Second Way:
Begin
for x in (
select s.testval3,testval5
from (select distinct testval3,testval5
from a_staging_table) s
where not exists
(select 1
from your_original_table
where mycolumn1=s.testval3
and mycolumn2=s.testval5)
) loop
DBMS_OUTPUT.put_line('testval3: '||x.testval3 || ' ------ '||'testval5: '||x.testval5);
--write all these values into another file saying that these are not matching values, using UTL_FILE.
--Then finally truncate the table "a_staging_table"
--so that this data will not available next time, and next time again process will run with different file
end loop;
A fast way to do this would be to gather all available testval3 and testval5 combinations from the table at the beginning of the script, store them in a hashtable or similar structure, so that when you read each line you'd easily query a local in memory data structure.
It will use more memory, for sure, but it will run a single validation query and speed the program up many times.
The query to run would be select distinct mycolumn,mycolumn2 from mytable or an equivalent.
See
How do I (or can I) SELECT DISTINCT on multiple columns? and
ORACLE Select Distinct return many columns and where
So, in summary, the mechanism I'm proposing is:
Run a query to select all the distinct pairs of testval3 and testval5 from the table
Create a hash table and a specific Pair structure that you store on it. For example, in Java you could use A Java collection of value pairs? (tuples?) and then, if the types of your columns are strings, use something like
HashMap<Pair<String,String>, boolean> pairMap
Make sure to implement the hashCode and equals methods as they are in the other example answer so you can use it as keys on the map properly (if using Java or similar)
Store the result of the query in your hash table, having the pairs of testval3 and testval5 as keys on the table and true as values (how to iterate over the resultset is left as an exercise for the reader):
pairMap.put(new Pair<String,String>(testval3,testval5),true)
Read the file line by line
Foreach line, find if a pair testval3 and testval5 exists on your hash table, if it doesn't then output the line to the CSV. To do that, you can simply query your map and check for null (Key existence check in HashMap)
For example:
if (pairMap.get(new Pair<String,String>(testval3,testval5)) == null) {
//output to CSV
}
Finally, another option as #Kaushik Nayak is saying and #Vivek is implying in the comments, is to load the file onto Oracle with its data loading tools and then run a single query for the values that don't exist.

Decode in SQL vs. If... Else in Java

I'm looking for a solution to a simple scenario. I need to check if a value is present in a table, and if present I need Y else N
I can do it in two ways, either fetch the count of rows from the database, and code the logic in java, or use DECODE(COUNT(*),0,'N','Y')
Which is better? Is there any advantage of one over the other? Or more specifically, is there any disadvantage of using DECODE() instead of doing it in Java?
The database I have is DB2.
You should use exists. I would tend to do this as:
select (case when exists (select 1 from . . . .)
then 'Y' else 'N'
end) as flag
from sysibm.sysdummy1;
The reason you want to use exists is because it is faster. When you use count(*), the SQL engine has to process all the (appropriate) data to get the count. With exists, it can stop at the first one.
The reason to prefer case over decode() is that the former is ANSI standard SQL, available in basically all databases.
It shouldn't be any considerable difference between those 2 ways that you mentioned.
1) The DECODE will be simple and the IF will be simple.
2) You will be receiving an Int32 versus a CHAR(1) - which is not a significant difference.
So, I would consider another aspect: Which of those 2 will make your code more CLEAR?
And one more thing: if this is the ONLY thing that you're selecting on that query, you could try something like:
SELECT 'Y' FROM DUAL WHERE EXISTS (SELECT 1 FROM YOURTABLE WHERE YOURCONDITION = 1); --Oracle SQL - but should be fairly easy to translate it to DB2
This is an option to not make the DB count for every occurrence of your condition just to check if it exists.
Aggregated functions like count can be optimized with MQT - Materilized Query Tables
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/
connect to sample
alter table employee add unique (empno)
alter table department add unique (deptno)
create table count_emp_dpto_1 as (select d.deptno, e.empno, count(*) from employee e, department d where d.deptno = 1 and e.workdept = d.deptno) data initially deferred refresh immediate
set integrity for count_emp_dpto_1 immediate checked not incremental
select * from count_emp_dpto_1
connect reset

Better to query once, then organize objects based on returned column value, or query twice with different conditions?

I have a table which I need to query, then organize the returned objects into two different lists based on a column value. I can either query the table once, retrieving the column by which I would differentiate the objects and arrange them by looping through the result set, or I can query twice with two different conditions and avoid the sorting process. Which method is generally better practice?
MY_TABLE
NAME AGE TYPE
John 25 A
Sarah 30 B
Rick 22 A
Susan 43 B
Either SELECT * FROM MY_TABLE, then sort in code based on returned types, or
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'A' followed by
SELECT NAME, AGE FROM MY_TABLE WHERE TYPE = 'B'
Logically, a DB query from a Java code will be more expensive than a loop within the code because querying the DB involves several steps such as connecting to DB, creating the SQL query, firing the query and getting the results back.
Besides, something can go wrong between firing the first and second query.
With an optimized single query and looping with the code, you can save a lot of time than firing two queries.
In your case, you can sort in the query itself if it helps:
SELECT * FROM MY_TABLE ORDER BY TYPE
In future if there are more types added to your table, you need not fire an additional query to retrieve it.
It is heavily dependant on the context. If each list is really huge, I would let the database to the hard part of the job with 2 queries. At the opposite, in a web application using a farm of application servers and a central database I would use one single query.
For the general use case, IMHO, I will save database resource because it is a current point of congestion and use only only query.
The only objective argument I can find is that the splitting of the list occurs in memory with a hyper simple algorithm and in a single JVM, where each query requires a bit of initialization and may involve disk access or loading of index pages.
In general, one query performs better.
Also, with issuing two queries you can potentially get inconsistent results (which may be fixed with higher transaction isolation level though ).
In any case I believe you still need to iterate through resultset (either directly or by using framework's methods that return collections).
From the database point of view, you optimally have exactly one statement that fetches exactly everything you need and nothing else. Therefore, your first option is better. But don't generalize that answer in way that makes you query more data than needed. It's a common mistake for beginners to select all rows from a table (no where clause) and do the filtering in code instead of letting the database do its job.
It also depends on your dataset volume, for instance if you have a large data set, doing a select * without any condition might take some time, but if you have an index on your 'TYPE' column, then adding a where clause will reduce the time taken to execute the query. If you are dealing with a small data set, then doing a select * followed with your logic in the java code is a better approach
There are four main bottlenecks involved in querying a database.
The query itself - how long the query takes to execute on the server depends on indexes, table sizes etc.
The data volume of the results - there could be hundreds of columns or huge fields and all this data must be serialised and transported across the network to your client.
The processing of the data - java must walk the query results gathering the data it wants.
Maintaining the query - it takes manpower to maintain queries, simple ones cost little but complex ones can be a nightmare.
By careful consideration it should be possible to work out a balance between all four of these factors - it is unlikely that you will get the right answer without doing so.
You can query by two conditions:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B'
This will do both for you at once, and if you want them sorted, you could do the same, but just add an order by keyword:
SELECT * FROM MY_TABLE WHERE TYPE = 'A' OR TYPE = 'B' ORDER BY TYPE ASC
This will sort the results by type, in ascending order.
EDIT:
I didn't notice that originally you wanted two different lists. In that case, you could just do this query, and then find the index where the type changes from 'A' to 'B' and copy the data into two arrays.

Efficient way to check if record (from a large set of data) is existing in the Database (JPA/Hibernate)

We have a large set of data (bulk data) that needs to be checked if the record is existing in the database.
We are using SQL Server2012/JPA/Hibernate/Spring.
What would be an efficient or recommended way to check if a record exists in the database?
Our entity ProductCodes has the following fields:
private Integer productCodeId // this is the PK
private Integer refCode1 // ref code 1-5 has a unique constraint
private Integer refCode2
private Integer refCode3
private Integer refCode4
private Integer refCode5
... other fields
The service that we are creating will be given a file where each line is a combination of refCode1-5.
The task of the service is to check and report all lines in the file that are already existing in the database.
We are looking at approaching this in two ways.
Approach1: Usual approach.
Loop through each line and call the DAO to query the refCode1-5 if existing in the db.
//psuedo code
for each line in the file
call dao. pass the refCode1-5 to query
(select * from ProductCodes where refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?
given a large list of lines to check, this might be inefficient since we will be invoking the DAO xxxx number of times. If the file say consists of 1000 lines to check, this will be 1000 connections to the DB
Approach2: Query all records in the DB approach
We will query all records in the DB
Create a hash map with concatenated refCode1-5 as keys
Loop though each line in the file validating against the hashmap
We think this is more efficient in terms of DB connection since it will not create 1000 connections to the DB. However, if the DB table has for example 5000 records, then hibernate/jpa will create 5000 entities in memory and probably crash the application
We are thinking of going for the first approach since refCode1-5 has a unique constraint and will benefit from the implicit index.
But is there a better way of approaching this problem aside from the first approach?
try something like a batch select statement for say 100 refCodes instead of doing a single select for each refCode.
construct a query like
select <what ever you want> from <table> where ref_code in (.....)
Construct the select projection in a way that not just gives you wnat you want but also the details of ref_code. Teh in code you can do a count or multi-threaded scan of resultset if DB said you got less refCodes that the number you codes you entered in query.
You can try to use the concat operator.
select <your cols> from <your table> where concat(refCode1, refCode2, refCode3, refCode4, refCode5) IN (<set of concatenation from your file>);
I think this will be quite efficient and it may be worth to try to see if pre-sorting the lines and playing with the num of concatenation taken each times bring you some benefits.
I would suggest you create a temp table in your application where all records from file are stored initially with batch save, and later you run a query joining new temp table and productCodes table to achieve filtering how you like. In this way you are not locking productCodes table many times to check individual row as SqlServer locks rows on select statement as well.

How to execute a SQL statement in Java with many values in a single variable in where in clause

I have to execute below query through JDBC call
select primaryid from data where name in ("abc", adc", "anx");
Issue is inside in clause I have to pass 11000 strings. Can I use prepared statement here? Or any other solution any one can suggest. I dont want to execute the query for each record, as it is consuming time. I need to run this query in very less time.
I am reading the strings from an XML file using DOMParser. and I am using sql server db.
I'm just wondering why you would need to have a manual set of 11,000 items where you need to specify each item. It sounds like you need to bring the data into a staging table
(surely it's not been selected from the UI..?), then join to that to get your desired resultset.
Using an IN clause with 11k literal values is a really bad idea - off the top of my head, I know one major RDBMS (Oracle) that doesn't support more than 1k values in the IN list.
What you can do instead:
create some kind of (temporary) table T_NAMES to hold your names; if your RDBMS doesn't support "real" (session-specific) temporary tables, you'll have to add some kind of session ID
fill this table with the names you're looking for
modify your query to use the temporary table instead of the IN list: select primaryid from data where name in (select name from T_NAMES where session_id = ?session_id) or (probably even better) select primaryid from data join t_names on data.name = t_names.name and t_names.session_id = ?session_id (here, ?session_id denotes the bind variable used to pass your session id)
A prepared statement will need to know the number of arguments in advance - something along the lines of :
PreparedStatement stmt = conn.prepareStatement(
"select id, name from users where id in (?, ?, ?)");
stmt.setInt(1);
stmt.setInt(2);
stmt.setInt(3);
11,000 is a large number of parameters. It may be easiest to use a 'batch' approach as described here (in summary - looping over your parameters, using a prepared statement
each time)
Note - if your 11,000 strings are the result of an earlier database select, then the best approach is to write a stored procedure to do the whole calculation in the database (avoiding passing the 11,000 strings back and forth with your code)
You can merge all your parameter strings into one bitg string separating by ';' char
bigStrParameter=";abc;adc;anx;"
And use LOCATE to find substring.
select primaryid from data where LOCATE(concat(';',name,';'),?)>=0;

Categories