Partitioner for result set from oracle database in spring batch - java

I need to extract the results from query in to flat file. Is there a way to partition the result set so that it can be accessed by multiple threads.
I tried partitioning based on ROWNUM without sort, but when same query is executed by multiple threads ROWNUM does not remain same(because I am not sorting due to performance impact) and creates duplicates in output.

Use ORA_HASH to split rows into deterministic buckets:
select *
from
(
select level, ora_hash(level, 2) bucket
from dual
connect by level <= 10
)
where bucket = 2;
LEVEL BUCKET
----- ------
1 2
3 2
6 2
10 2
It's a 0-based number. Use bucket = 0 and bucket = 1 to get the other 2 sets.

Use ROWID instead. ROWID is immutable for every record. Or just use the Primary key (or any other field with enough distinct values for that mather) to devide the data in subsets.
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) IN ('A','a','0');
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) IN ('B','b','1');
or
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) between 'A' and 'Z';
etc.
You'll have to experiment a little with the where clause. As far as I know the last character of rownum can contain [A-Z][a-z][0-9] + and /

Related

Algorithm or SQL : to find where conditions for a set of columns which ensures result set has value in a particular column always > 0

I am working on a java-oracle based project where I stuck with an problem which seems to me requires an analytic solution.
I am looking for solution either based on SQL query or any algorithm or any free analytic tool which I can follow to get desired results.
Problem statement:
Lets us say I have below table with columnA-D and last column as Score, I want to find an criteria on values for each of the columns which when combined in SQL where clause will always give me positive value for Score column. So basically what combination of columnA-D will always give me positive score?
columnA|columnB|columnC|columnD|Score
1 40 10 3 -20
0 40 2 3 10
0 10 3 3 20
1 15 3 3 -5
0 10 2 2 -15
0 15 6 3 -10
Expected result for above data set:-
Visual interpretation of above data set gives me condition: “ColumnA =0 and columnB >10 and columnC <5 will ensure score always >0”. (visually its clear columnD does not have an effect).
Please note above data set is for sake of simplicity. In reality, my project contains around 40 columns with almost 2500 rows. One thing is for sure each of columns have finite range of values.
Following information copied from OPs answer below
Here is an algorithm I started with (need inputs to refine it further if someone thinks I am in right direction):
Preparation: Create an list of all possible expressions like A=0, B>10,C<5 (for 40 columns, I finalized total approx 150 expressions)
Let us call it "expressions" variable.
Alogrithm for 1st run:
set totalPositiveRows= select count(*) from my tables where score>0;
set totalNegativeRows= select count(*) from my tables where score<0;
For each expr in expressions, calculate following three variables
set positivePercentage= find percentage of totalPositiveRows which satisfy this expr; //like if 60 rows out of total 100 rows having score>0 satisfy expr , then positivePercentage=60%
set negativePercentage= find percentage of totalNegativeRows which satisfy this expr; //like if 40 rows out of total 100 rows having score<0 satisfy expr , then negativePercentage=40%
set diffPercentage=positivePercentage-negativePercentage;
Set initialexpr=Choose expr having maximum value of diffPercentage
set initalPositivePercentage=choose corresponding positivePercentage value;
set initalNegativePercentage=choose corresponding negativePercentage value;
My thinking is that I need to now keep expanding initalexpr until initalNegativePercentage becomes 0.
Alogrithm for subsequent runs until initalNegativePercentage becomes 0:-
For each expr in expressions, calculate following three variables
set newexpr=initialexpr+" and "+expr;
set positivePercentage= find percentage of totalPositiveRows which satisfy newexpr;
set negativePercentage= find percentage of totalNegativeRows which satisfy newexpr;
//calculate how much negative percentage it has reduced?
set positiveReduction=initalPositivePercentage-positivePercentage;
set negativeReduction=initalNegativePercentage-negativePercentage;
if(negativeReduction>=positiveReduction)
//note it down
else
//discard it
Choose the expr which gives maxium negative reduction, that becomes new inital expr.
Set initialexpr=Choose expr having maximum value of negativeReduction
set initalPositivePercentage=choose corresponding value;
set initalNegativePercentage=choose corresponding value;
Repeat the algorithm above.
Please comment.
Below is a brute-force solution. This might also be a good question for the theoretical computer science site. I think this is an NP-complete problem similar to Boolean satisfiability, but that's just a wild guess. There may be a smarter way to solve this but I don't think I'll find it.
The basic idea is to cross join every expression with every distinct value for a column, and then cross join all the columns. The table will be queried with every expression list, and a count generated for positive and negative scores. If the expression returns the expected number of positive scores and none of the negative scores it is valid.
This assumes that you are only using the expressions >, <, and =. Every new column or expression will make this problem exponentially slower.
Test schema
drop table table1;
create table table1(a number, b number, c number, d number, score number);
insert into table1
select 1, 40, 10, 3, -20 from dual union all
select 0, 40, 2, 3, 10 from dual union all
select 0, 10, 3, 3, 20 from dual union all
select 1, 15, 3, 3, -5 from dual union all
select 0, 10, 2, 2, -15 from dual union all
select 0, 15, 6, 3, -10 from dual;
Wall of code
declare
v_inline_view varchar2(32767);
v_inline_views clob;
v_inline_view_counter number := 0;
v_and_expression varchar2(32767);
v_query clob;
v_sqls sys.odcivarchar2list;
v_dynamic_query_counter number := 0;
begin
--#1: Create inline views.
--One for every combination of expression and distinct value, per column.
for inline_views in
(
--Inline view for every possible expression for each column.
select
replace(q'[
(
select *
from
(
--Possible expressions.
select distinct
case
when operator is null then null
else ' AND %%COLUMN%% '||operator||' '||%%COLUMN%%
end %%COLUMN%%_expression
from
--All operators.
(
select '>' operator from dual union all
select '<' operator from dual union all
select '=' operator from dual union all
select null operator from dual
)
--All distinct values.
cross join
(
select distinct %%COLUMN%% from table1
)
)
--where %%COLUMN%%_expression is null or %%COLUMN%%_expression %%EXPRESSION_PERFORMANCE_EXCLUSIONS%%
)
]', '%%COLUMN%%', column_name) inline_view
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE'
order by column_name
) loop
--Assign to temorary so it can be modified.
v_inline_view := inline_views.inline_view;
--#1A: Optimize inline view - throw out expressions if they don't return any positive results.
declare
v_expressions sys.odcivarchar2list;
v_expressions_to_ignore varchar2(32767);
v_has_score_gt_0 number;
begin
--Gather expressions for one column.
execute immediate v_inline_view bulk collect into v_expressions;
--Loop through and test each expression.
for i in 1 .. v_expressions.count loop
--Always keep the null expression.
if v_expressions(i) is not null then
--Count the number of rows with a positive score.
execute immediate 'select nvl(max(case when score > 0 then 1 else 0 end), 0) from table1 where '||replace(v_expressions(i), ' AND ', null)
into v_has_score_gt_0;
--If the expression returns nothing positive then add it to exclusion.
if v_has_score_gt_0 = 0 then
v_expressions_to_ignore := v_expressions_to_ignore||','''||v_expressions(i)||'''';
end if;
end if;
end loop;
--Convert it into an IN clause.
if v_expressions_to_ignore is not null then
--Remove comment, replace placeholder with expression exclusions.
v_inline_view := regexp_replace(v_inline_view, '(.*)(--where)( .* )(%%EXPRESSION_PERFORMANCE_EXCLUSIONS%%)(.*)', '\1where\3 not in ('||substr(v_expressions_to_ignore, 2)||')');
end if;
end;
--Aggregate and count inline views.
if v_inline_view_counter <> 0 then
v_inline_views := v_inline_views||'cross join';
end if;
v_inline_views := v_inline_views||v_inline_view;
v_inline_view_counter := v_inline_view_counter + 1;
end loop;
--#2: Create an AND expression to combine all column expressions.
select listagg(column_name||'_expression', '||') within group (order by column_name)
into v_and_expression
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE';
--#3: Create a that will create all possible expression combinations.
v_query :=
replace(replace(q'[
--8281 combinations
select '
select
'''||expressions||''' expression,
nvl(sum(case when score > 0 then 1 else 0 end), 0) gt_0_score_count,
nvl(sum(case when score <= 0 then 1 else 0 end), 0) le_0_score_count
from table1
where 1=1 '||expressions v_sql
from
(
--Combine expressions
select %%AND_EXPRESSION%% expressions
from
%%INLINE_VIEWS%%
) combined_expressions
]', '%%AND_EXPRESSION%%', v_and_expression), '%%INLINE_VIEWS%%', v_inline_views);
--TEST: It might be useful to see the full query here.
--dbms_output.put_line(v_query);
--#4: Gather expressions.
--With larger input you'll want to use a LIMIT
execute immediate v_query
bulk collect into v_sqls;
--#5: Test each expression.
--Look for any queries that return the right number of rows.
for i in 1 .. v_sqls.count loop
declare
v_expression varchar2(4000);
v_gt_0_score_count number;
v_le_0_score_count number;
begin
execute immediate v_sqls(i) into v_expression, v_gt_0_score_count, v_le_0_score_count;
v_dynamic_query_counter := v_dynamic_query_counter + 1;
--TODO: Dynamically generate "2".
if v_gt_0_score_count = 2 and v_le_0_score_count = 0 then
dbms_output.put_line('Expression: '||v_expression);
end if;
exception when others then
dbms_output.put_line('Problem with: '||v_sqls(i));
end;
end loop;
dbms_output.put_line('Queries executed: '||v_dynamic_query_counter);
end;
/
Results
The results appear correct. They are slightly different than yours because "columnB > 10" is incorrect.
Expression: AND A = 0 AND C < 6 AND D = 3
Expression: AND A = 0 AND C < 6 AND D > 2
Expression: AND A < 1 AND C < 6 AND D = 3
Expression: AND A < 1 AND C < 6 AND D > 2
Queries executed: 441
Problems
This brute-force approach is extremely inefficient in at least two ways. Even for this simple example it requires 6370 queries.. And the results may include duplicates that are non-trivial to reduce. Or perhaps you'll get lucky and there are so few solutions that you can eyeball them.
There are a few things you can do to improve query performance. The easiest one would be to check every condition individually and throw it out if it does not "gain" anything for the counts.
Optimizations
Individual expressions that do not return any positive results are excluded. With the sample data, this reduces the number of query executions from 6370 to 441.
Running the process in parallel may also improve the performance by an order of magnitude. It would probably require a parallel pipelined function.
But even a 100x performance improvement may not help with an NP-complete problem. You may need to find some additional "short cuts" based on your sample data.
It may help to print out the query that generates the test queries, by un-commenting one of the dbms_output.put_line statements. Add a count(*) to see how many queries will be executed and run with a smaller data set to make an estimate for how long it will take.
If the estimate is a billion years, and you can't think of any tricks to make the brute-force method work faster, it may be time to ask this question on https://cstheory.stackexchange.com/
The idea of the solution is that the columns are independent. So it can be solved column by column.
So you can imagine that you search and build something in multidimensional space. Each column represents a dimension, having values from -inf to +inf. And build the solution dimension by dimension.
For the 1st column the solution is: A=1 => false, A=0 => true.
Then you add 2nd dimension B. You have 5 values, so dimension on column B be is divided into 6 intervals. Some of the consecutive intervals can be joined. For example <10, 50> and <50,inf> do both imply true.
And then you add 3rd dimension.
...
If you want to join dimension intervals on SQL level you can use LAG function. By using partitioning and windowing you order rows by one column. Then you compute a value true/false in a cooked column. And in the next cooked column by using the LAG function you detect whether the true/false flag did change from the previous row.
create table test
(
b number,
s number
);
insert into test values(10, -20);
insert into test values(50, 10);
insert into test values(15, 20);
insert into test values(18, 5);
select u.*,
case when LAG (b, 1, null) OVER (ORDER BY b) = b then 'Y' else 'N' end same_b_value,
LAG (score_flag, 1, null) OVER (ORDER BY b) AS score_flag_prev,
case when LAG (score_flag, 1, null) OVER (ORDER BY b) <> score_flag then 'Y' else 'N' end score_flag_changed
from
(
select t.*,
case when t.s >= 0 then 'Y' else 'N' end as score_flag
from test t
) u
order by b asc;
This query will show that value B=15 is significant because it is where the score_flag changes.
I'm not sure about value B=10 in the question. As this one is linked to both positive and negative score values. Should it be included or excluded then?
Very interesting problem. My proposition bases on function check_column, code below. Execution examples:
select CHECK_COLUMN('col01') from dual; => "COL01 = 0"
select CHECK_COLUMN('col03') from dual; => "COL03 <= 2"
select column_name cn, check_column(column_name) crit
from all_tab_columns atc
where atc.table_name='SCORES' and column_name like 'COL%';
cn crit
COL01 COL01 = 0
COL02 COL02 >= 32
COL03 COL03 <= 2
COL04 COL04 = COL04
In your example, row 3, columnB I replaced value 10 with 32, because example was not good, and condition and columnB >10 was not right. Col04 is only for presentation, as it's neutral. You need to stick output strings together in java or sql, but that shouldn't be a problem.
I named base table as scores, then created view positives, you can instead of view put data in some temporary table, execution should be much faster.
create or replace view positives as
select distinct col01, col02, col03, col04
from scores where score>0
minus select COL01,COL02,COL03,COL04
from scores where score<0;
Function is:
create or replace function check_column(i_col in varchar2) return varchar2 as
v_tmp number;
v_cnt number;
v_ret varchar2(4000);
begin
-- candidate for neutral column ?
execute immediate 'select count(distinct '||i_col||') from positives' into v_tmp;
execute immediate 'select count(distinct '||i_col||') from scores' into v_cnt;
if v_tmp = v_cnt then
return i_col||' = '||i_col; -- empty string is better, this is for presentation
end if;
-- candidate for "column = some_value" ?
execute immediate 'select count(distinct '||i_col||') from positives' into v_cnt;
if v_cnt = 1 then
execute immediate 'select distinct '||i_col||' from positives' into v_tmp;
return i_col||' = '||v_tmp;
end if;
-- is this candidate for "column >= some_value" ?
execute immediate 'select min(distinct '||i_col||') from positives' into v_tmp;
execute immediate 'select count(1) from scores where '||i_col||
' not in (select '||i_col||' from positives) and '||i_col||'>'||v_tmp into v_cnt;
if v_cnt = 0 then
execute immediate 'select min('||i_col||') from scores' into v_cnt;
if v_cnt != v_tmp then
return i_col||' >= '||v_tmp;
end if;
end if;
-- is this candidate for "column <= some_value" ?
execute immediate 'select max(distinct '||i_col||') from positives' into v_tmp;
execute immediate 'select count(1) from scores where '||i_col||
' not in (select '||i_col||' from positives) and '||i_col||'<'||v_tmp into v_cnt;
if v_cnt = 0 then
execute immediate 'select max('||i_col||') from scores' into v_cnt;
if v_cnt != v_tmp then
return i_col||' <= '||v_tmp;
end if;
end if;
-- none of the above, have to list specific values
execute immediate 'select listagg('||i_col||', '', '') '
||'within group (order by '||i_col||') '
||'from (select distinct '||i_col||' from positives)' into v_ret;
return i_col||' in ('||v_ret||')';
end check_column;
This solution is nor optimized nor heavily tested, please be careful.
If you have Oracle version < 11 replace listagg with wmsys.wm_concat.
Here's what I would do:
Check the minimum and maximum values for every "input" column
Check the minimum and maximum values for every "input" column for the subset where score > 0
Now, for each "input" column:
If the minimum and maximum values for point 1 and point 2 are the same, that column has no bearing
Otherwise, if the minimum and maximum values for point 2 are the same, that column is an "="
Otherwise, if the minimum is the same but the maximum is not, that column is a "<" with the maximum from point 2 as the reference
Otherwise, if the maximum is the same but the minimum is not, that column is a ">" with the minimum from point 2 as the reference
Otherwise, the column is a "< AND >"
Note that this all assumes that Score is (hypothetically) driven by continuous ranges in the "input" columns. It won't be able to spot conditions of, say "<5 or >10", or "<>12". Since neither of those are not in your example, I speculate it might be fine, but if it's not, you're back to NP-complete...
SQL to generate queries to output the conditions above for an arbitrary schema should be relatively easy to construct. Let me know if you want a hand with that and I'll look into it.
SELECT min(a), max(a) from MyTable WHERE score > 0;
SELECT min(a), max(a) from MyTable;
SELECT min(b), max(b) from MyTable WHERE score > 0;
SELECT min(b), max(b) from MyTable;
SELECT min(c), max(c) from MyTable WHERE score > 0;
SELECT min(c), max(c) from MyTable;
SELECT min(d), max(d) from MyTable WHERE score > 0;
SELECT min(d), max(d) from MyTable;
This will give you the range of each column for positive scores, then the range of these columns over all scores. Where these ranges are different, you have a correlation
Here is an algorithm I started with (need inputs to refine it further if someone thinks I am in right direction):
Preparation:
Create an list of all possible expressions like A=0, B>10,C<5 (for 40 columns, I finalized total approx 150 expressions)
Let us call it "expressions" variable.
Alogrithm for 1st run:
1. set totalPositiveRows= select count(*) from my tables where score>0;
set totalNegativeRows= select count(*) from my tables where score<0;
2. For each expr in expressions, calculate following three variables
set positivePercentage= find percentage of totalPositiveRows which satisfy this expr; //like if 60 rows out of total 100 rows having score>0 satisfy expr , then positivePercentage=60%
set negativePercentage= find percentage of totalNegativeRows which satisfy this expr; //like if 40 rows out of total 100 rows having score<0 satisfy expr , then negativePercentage=40%
set diffPercentage=positivePercentage-negativePercentage;
3. Set initialexpr=Choose expr having maximum value of diffPercentage
set initalPositivePercentage=choose corresponding positivePercentage value;
set initalNegativePercentage=choose corresponding negativePercentage value;
My thinking is that I need to now keep expanding initalexpr until initalNegativePercentage becomes 0.
Alogrithm for subsequent runs until initalNegativePercentage becomes 0:-
1. For each expr in expressions, calculate following three variables
set newexpr=initialexpr+" and "+expr;
set positivePercentage= find percentage of totalPositiveRows which satisfy newexpr;
set negativePercentage= find percentage of totalNegativeRows which satisfy newexpr;
//calculate how much negative percentage it has reduced?
set positiveReduction=initalPositivePercentage-positivePercentage;
set negativeReduction=initalNegativePercentage-negativePercentage;
if(negativeReduction>=positiveReduction)
//note it down
else
//discard it
2. Choose the expr which gives maxium negative reduction, that becomes new inital expr.
Set initialexpr=Choose expr having maximum value of negativeReduction
set initalPositivePercentage=choose corresponding value;
set initalNegativePercentage=choose corresponding value;
3. Repeat the algorithm above.
Please comment.
Here's a simple implementation that will result in a complicated set of rules.
Let A be the set of all inputs that result in a positive score, and B be the set of all inputs that don't result in a positive score.
If any set of inputs is in both A and B then no rule will give all the positives and no negatives. Regardless, A-B is a set of rules that will give only positive values, and no set of rules that excludes all non-positives can do better.
In your example, our rules are:
(colA=0, colB=40, colC=2, colD=3),
(colA=0, colB=10, colC=3, colD=3).

DBAdapter fetch random entries uniquely [duplicate]

In MySQL, you can select X random rows with the following statement:
SELECT * FROM table ORDER BY RAND() LIMIT X
This does not, however, work in SQLite. Is there an equivalent?
For a much better performance use:
SELECT * FROM table WHERE id IN (SELECT id FROM table ORDER BY RANDOM() LIMIT x)
SQL engines first load projected fields of rows to memory then sort them, here we just do a random sort on id field of each row which is in memory because it's indexed, then separate X of them, and find the whole row using these X ids.
So this consume less RAM and CPU as table grows!
SELECT * FROM table ORDER BY RANDOM() LIMIT X
SELECT * FROM table ORDER BY RANDOM() LIMIT 1
All answers here are based on ORDER BY. This is very inefficient (i.e. unusable) for large sets because you will evaluate RANDOM() for each record, and then ORDER BY which is a resource expensive operation.
An other approach is to place abs(CAST(random() AS REAL))/9223372036854775808 < 0.5 in the WHERE clause to get in this case for example 0.5 hit chance.
SELECT *
FROM table
WHERE abs(CAST(random() AS REAL))/9223372036854775808 < 0.5
The large number is the maximum absolute number that random() can produce. The abs() is because it is signed. Result is a uniformly distributed random variable between 0 and 1.
This has its drawbacks. You can not guarantee a result and if the threshold is large compared to the table, the selected data will be skewed towards the start of the table. But in some carefully designed situations, it can be a feasible option.
This one solves the negative RANDOM integers, and keeps good performance on large datasets:
SELECT * FROM table LIMIT 1 OFFSET abs(random() % (select count(*) from table));
where:
abs(random() % n ) Gives you a positive integer in range(0,n)
The accepted answer works, but requires a full table scan per query. This will get slower and slower as your table grows large, making it risky for queries that are triggered by end-users.
The following solution takes advantage of indexes to run in O(log(N)) time.
SELECT * FROM table
WHERE rowid > (
ABS(RANDOM()) % (SELECT max(rowid) FROM table)
)
LIMIT 1;
To break it down
SELECT max(rowid) FROM table - Returns the largest valid rowid for the table. SQLite is able to use the index on rowid to run this efficiently.
ABS(RANDOM()) % ... - Return a random number between 0 and max(rowid) - 1). SQLite's random function generates a number between -9223372036854775808 and +9223372036854775807. The ABS makes sure its positive, and the modulus operator gates it between max(rowid) - 1.
rowid > ... - Rather than using =, use > in case the random number generated corresponds to a deleted row. Using strictly greater than ensures that we return a row with a row id between 1 (greater than 0) and max(rowid) (great than max(rowid) - 1). SQLite uses the primary key index to efficiently return this result as well.
This also works for queries with WHERE clauses. Apply the WHERE clause to both the output and the SELECT max(rowid) subquery. I'm not sure which conditions this will run efficiently, however.
Note: This was derived from an answer in a similar question.

Get a random record from a huge database [duplicate]

How can I request a random row (or as close to truly random as is possible) in pure SQL?
See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
Solutions like Jeremies:
SELECT * FROM table ORDER BY RAND() LIMIT 1
work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:
SELECT * FROM table WHERE num_value >= RAND() *
( SELECT MAX (num_value ) FROM table )
ORDER BY num_value LIMIT 1
This works in logarithmic time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).
I don't know how efficient this is, but I've used it before:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
Because GUIDs are pretty random, the ordering means you get a random row.
ORDER BY NEWID()
takes 7.4 milliseconds
WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)
takes 0.0065 milliseconds!
I will definitely go with latter method.
You didn't say which server you're using. In older versions of SQL Server, you can use this:
select top 1 * from mytable order by newid()
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
SELECT FirstName, LastName
FROM Contact
TABLESAMPLE (1 ROWS) ;
For SQL Server
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.
PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET #n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING #n;
Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):
randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")
This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.
newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.
rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.
For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.
Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
SELECT TOP 1
word
FROM (
SELECT TOP(#idx)
word
FROM
dbo.DictionaryAbridged WITH(NOLOCK)
ORDER BY
word DESC
) AS D
ORDER BY
word ASC
Of course, #idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.
Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.
Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):
SELECT MAX(ID) FROM TABLE;
get a random between 1..Max (=My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
and then run this SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value.
It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1
As pointed out in #BillKarwin's comment on #cnu's answer...
When combining with a LIMIT, I've found that it performs much better (at least with PostgreSQL 9.1) to JOIN with a random ordering rather than to directly order the actual rows: e.g.
SELECT * FROM tbl_post AS t
JOIN ...
JOIN ( SELECT id, CAST(-2147483648 * RANDOM() AS integer) AS rand
FROM tbl_post
WHERE create_time >= 1349928000
) r ON r.id = t.id
WHERE create_time >= 1349928000 AND ...
ORDER BY r.rand
LIMIT 100
Just make sure that the 'r' generates a 'rand' value for every possible key value in the complex query which is joined with it but still limit the number of rows of 'r' where possible.
The CAST as Integer is especially helpful for PostgreSQL 9.2 which has specific sort optimisation for integer and single precision floating types.
For MySQL to get random record
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
More detail http://jan.kneschke.de/projects/mysql/order-by-rand/
With SQL Server 2012+ you can use the OFFSET FETCH query to do this for a single random row
select * from MyTable ORDER BY id OFFSET n ROW FETCH NEXT 1 ROWS ONLY
where id is an identity column, and n is the row you want - calculated as a random number between 0 and count()-1 of the table (offset 0 is the first row after all)
This works with holes in the table data, as long as you have an index to work with for the ORDER BY clause. Its also very good for the randomness - as you work that out yourself to pass in but the niggles in other methods are not present. In addition the performance is pretty good, on a smaller dataset it holds up well, though I've not tried serious performance tests against several million rows.
Random function from the sql could help. Also if you would like to limit to just one row, just add that in the end.
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows) and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT and ORDER BY are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID() or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare #rows int
select #rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare #skip int = convert(int, #rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (#skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare #sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select #sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare #rows int
select #rows = count(1) from t
declare #sample_size int = 1000
declare #sample_percent decimal(7, 4) = case
when #rows <= 1000 then 100 -- not enough rows
when (100.0 * #sample_size / #rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * #sample_size / #rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
SELECT * FROM table ORDER BY RAND() LIMIT 1
Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.
There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.
The following solution works on PostgreSQL 8.4:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
limit 1;
I above solution you guess 10 various random index values from range 0 .. [last value of id].
The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.
There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
union all (select * from cms_refs order by random() limit 1)
limit 1;
Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!
You may also try using new id() function.
Just write a your query and use order by new id() function. It quite random.
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.
In MSSQL (tested on 11.0.5569) using
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
is significantly faster than
SELECT TOP 100 * FROM employee ORDER BY NEWID()
For Firebird:
Select FIRST 1 column from table ORDER BY RAND()
In SQL Server you can combine TABLESAMPLE with NEWID() to get pretty good randomness and still have speed. This is especially useful if you really only want 1, or a small number, of rows.
SELECT TOP 1 * FROM [table]
TABLESAMPLE (500 ROWS)
ORDER BY NEWID()
I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.
I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.
Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.
See: http://www.mssqltips.com/tip.asp?tip=1308
This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.
http://msdn.microsoft.com/en-us/library/ms189108.aspx
It seems that many of the ideas listed still use ordering
However, if you use a temporary table, you are able to assign a random index (like many of the solutions have suggested), and then grab the first one that is greater than an arbitrary number between 0 and 1.
For example (for DB2):
WITH TEMP AS (
SELECT COMLUMN, RAND() AS IDX FROM TABLE)
SELECT COLUMN FROM TABLE WHERE IDX > .5
FETCH FIRST 1 ROW ONLY
A simple and efficient way from http://akinas.com/pages/en/blog/mysql_random_row/
SET #i = (SELECT FLOOR(RAND() * COUNT(*)) FROM table); PREPARE get_stmt FROM 'SELECT * FROM table LIMIT ?, 1'; EXECUTE get_stmt USING #i;
There is better solution for Oracle instead of using dbms_random.value, while it requires full scan to order rows by dbms_random.value and it is quite slow for large tables.
Use this instead:
SELECT *
FROM employee sample(1)
WHERE rownum=1
For SQL Server 2005 and above, extending #GreyPanther's answer for the cases when num_value has not continuous values. This works too for cases when we have not evenly distributed datasets and when num_value is not a number but a unique identifier.
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)
select r.id, r.name from table AS r
INNER JOIN(select CEIL(RAND() * (select MAX(id) from table)) as id) as r1
ON r.id >= r1.id ORDER BY r.id ASC LIMIT 1
This will require a lesser computation time

Subsequence as primary key

I have a scenario where I need to generate a batch number (primary key) in the below format.
Batch Number: ( X X ) ( X X X X X )
Location Sequence
Eg: 0100001
0100002
0200001
0100003
0200002
.......
Sequence starts with 00001 for each of the batches. However we cannot have a sequence number generator to do this. The possible solutions that I have over my head are:
Create an extra table which holds the numbers. But there is a possibility that multiple users get the same sequence as there may be uncommitted transactions.
Every time an entity is saved we get the max(substring(batchnum,2)) from that column and add +1. But this will have a very huge overload on performance and also has the issue of multiple users getting the same sequence.

How to group results by intervals?

I have a table containing events with a "speed" property.
In order to see the statistical distribution of this property, I'd like to group the results by intervals, let's say:
[0-49.99km/h] 3 objects
[50-100km/h] 13 objects
[100-150km/h] 50 objects
etc
This would let me see that most objects are in a certain interval.
Obviously that could be done with several queries with the appropriate Where conditions, such as:
select count from GaEvent a where speed >= MIN and speed < MAX
but this is extremely inefficient.
Is there a better way of grouping these values?
Cheers!
A more efficient way to tackle this in SQL alone is to join the table in question against a derived table which contains the minimum and maximum values you want in your histogram.
For example:
select t.min, t.max, count(*)
from (
select 0 as min, 14.9 as max
union
select 15, 29.9
union
select 30, 44.9
union ...
) t
left outer join cars c on c.speed between t.min and t.max
group by t.min, t.max
order by t.min
min | max | count
-----------------
0 | 14.9 | 1
15 | 29.9 | 1
30 | 44.9 | 2
This is highly dependent on which database vendor you are using though. For example, PostgreSQL has a concept of window functions which may grossly simplify this type of query and prevent you from needing to generate the "histogram table" yourself.
When it comes to Hibernate though, there seems to be very little in the way of the Projections and support for aggregrate functions that would apply to anything like this. This may very well be a scenario where you want to drop down to using raw SQL for the query, and/or do the calculations in Java itself.
if your intervals are all of the same size, you can use something like this:
select 50*trunc(c.speed/50), count(*) from Car c group by 1

Categories