Getting HBase table rows on the basis of timestamp in Java - java

I have been working in HBase for last some weeks. My question is:
I have a HBase table with 100 records and each record having three columns in one column family and there is just one column family. Now I want to retrieve the rows on the basis of timestamp. Means the row which is added at the last should be retrieved first. Its like (LIFO). Now is this functionality available in HBase? If yes then how can I do it? I am using 0.98.3.
NOTE: While inserting data I did not mention timestamps manually.
I am trying to do it in Java language.
Regards

Rows are naturally sorted lexicographically by the row key (ascending), timestamp is not involved at all when performing full table scans, the first row retrieved will be the lowest one.
This would be the order in case of string row keys:
STRING ROW
0 \x30
00 \x30\x30
0000 \x30\x30\x30\x30
0001 \x30\x30\x30\x31
0002 \x30\x30\x30\x32
...
0010 \x30\x30\x31\x30
1 \x31
10 \x31\x30
2 \x32
a \x61
ab \x61\x62
...
zzzz \x7A\x7A\x7A\x7A
This would be the order in case of 4 byte signed integer row keys:
INT ROW
1 \x00\x00\x00\x01
2 \x00\x00\x00\x02
3 \x00\x00\x00\x03
4 \x00\x00\x00\x04
...
100 \x00\x00\x00\x64
...
10000 \x00\x00\x27\x10
...
MAX_INT \x7F\xFF\xFF\xFF
If you need the scan to work as a LIFO, you have to include the inverted timestamp as a prefix for your rowkey (although this design is not recommended for heavy write environments due of hotspotting).
byte[] rowKey = Bytes.add( Byte.toBytes( Long.MAX_VALUE - System.currentTimeMillis() ), "-myRow".getBytes());
If you don't invert the timestamp, it will work as a LILO.
For more information take a look at this section of the HBase Book: https://hbase.apache.org/book/rowkey.design.html

Related

How to use FuzzyRowFilter to fetch time series data

I have a table, where I'll be storing event data. Possibly, billions of records a day.
Here is the structure of my rowKey:
salt_bucket-clientUUID-eventUUID-timestamp-timeUUID
salt_bucket is used to avoid region hot spotting; is 1 byte that maps to my number of regions.
clientUUID. 16 bytes;
eventUUID. 16 bytes;
timestamp will be floored to the nearest time bucket (1 hour, 30 minutes or even more granular time period). 8 bytes;
timeUUID - UUID of version 1 used to guarantee uniqueness of my record, to prevent upserts. 16 bytes;
In my client Java app I want to perform a range scan. As an input parameter will be given timeWindow (start and end date), clientUUID and eventUUID.
I thought that the best candidate for this job would be FuzzyRowFilter. It allows to use byte mask on the row key. So that I'll be doing exact match on clientId and eventId components of the row key, whilst leaving ?s for salt_bucket and timeUUID.
Q:But what do I do with my time range? How do I make FuzzyRowFilter to honor my time window, so that I can refine my scan operation?

Efficient way to store and search number combinations

I have an application where i generate unique combinations of numbers like where each combination is unique.I want to store the combinations in a way that i should be able to retrieve a few random combinations efficently later.
Here is an example of one combination
35 36 37 38 39 40 50
There are ~4 millions of combinations in all.
How should I store the data so that i can retrieve combinations later?
Since your combinations are unique and you actually don't have a query criteria on your numbers, it does not matter how you store them in the database. Just insert them in some table. To retrieve X random combinations simply do:
SELECT * FROM table ORDER BY RANDOM() LIMIT X
See:
Select random row(s) in SQLite
On storing array of integers in SQLite:
Insert a table of integers - int[] - into SQLite database,
I think there might be a different solution; in the sense of: do you really have to store all those combinations?!
Assuming that those combinations are just "random" - you could be using some (smart) maths, to some function getCombinationFor(), like
public List<Integer> getCombinationFor(long whatever)
that uses a fixed algorithm to create a unique result for each incoming input.
Like:
getCombinationFor(0): gives 0 1 2 3 10 20 30
getCombinationFor(1): gives 1 2 3 4 10 20 30 40
The above is of course pretty simple; and depending on your requirements towards those sequences you might require something much complicated. But: for sure, you can define such a function to return a permutation of a fixed set of numbers within a certain range!
The important thing is: this function returns a unique List for each and any input; and also important: given a certain sequence, you can immediately determine the number that was used to create that sequence.
So instead of generating a huge set of data containing unique sequences, you simply define an algorithm that knows how to create unique sequences in a deterministic way. If that would work for you, it completely frees you from storing your sequences at all!
Edit: just remembered that I was looking into something kinda "close" to this question/answer ... see here!
Sending data:
Use UNIQUE index. On Unique Violation Exception just rerandomize and send next record.
Retrive:
The simplest possible to implement is: Use HashSet, feed it whith random numbers (max is number of combinations in your database -1) while hashSet.size() is less then desired number of records to retrive. Use numbers from hashSet as ID's or rownums of records (combinations) to select the data: WHERE ID in (ids here).

Partitioner for result set from oracle database in spring batch

I need to extract the results from query in to flat file. Is there a way to partition the result set so that it can be accessed by multiple threads.
I tried partitioning based on ROWNUM without sort, but when same query is executed by multiple threads ROWNUM does not remain same(because I am not sorting due to performance impact) and creates duplicates in output.
Use ORA_HASH to split rows into deterministic buckets:
select *
from
(
select level, ora_hash(level, 2) bucket
from dual
connect by level <= 10
)
where bucket = 2;
LEVEL BUCKET
----- ------
1 2
3 2
6 2
10 2
It's a 0-based number. Use bucket = 0 and bucket = 1 to get the other 2 sets.
Use ROWID instead. ROWID is immutable for every record. Or just use the Primary key (or any other field with enough distinct values for that mather) to devide the data in subsets.
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) IN ('A','a','0');
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) IN ('B','b','1');
or
select *
from table
where SUBSTR(ROWIDTOCHAR(ROWID),-1) between 'A' and 'Z';
etc.
You'll have to experiment a little with the where clause. As far as I know the last character of rownum can contain [A-Z][a-z][0-9] + and /

Get a random record from a huge database [duplicate]

How can I request a random row (or as close to truly random as is possible) in pure SQL?
See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
Solutions like Jeremies:
SELECT * FROM table ORDER BY RAND() LIMIT 1
work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:
SELECT * FROM table WHERE num_value >= RAND() *
( SELECT MAX (num_value ) FROM table )
ORDER BY num_value LIMIT 1
This works in logarithmic time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).
I don't know how efficient this is, but I've used it before:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
Because GUIDs are pretty random, the ordering means you get a random row.
ORDER BY NEWID()
takes 7.4 milliseconds
WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)
takes 0.0065 milliseconds!
I will definitely go with latter method.
You didn't say which server you're using. In older versions of SQL Server, you can use this:
select top 1 * from mytable order by newid()
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
SELECT FirstName, LastName
FROM Contact
TABLESAMPLE (1 ROWS) ;
For SQL Server
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.
PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET #n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING #n;
Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):
randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")
This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.
newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.
rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.
For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.
Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
SELECT TOP 1
word
FROM (
SELECT TOP(#idx)
word
FROM
dbo.DictionaryAbridged WITH(NOLOCK)
ORDER BY
word DESC
) AS D
ORDER BY
word ASC
Of course, #idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.
Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.
Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):
SELECT MAX(ID) FROM TABLE;
get a random between 1..Max (=My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
and then run this SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value.
It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1
As pointed out in #BillKarwin's comment on #cnu's answer...
When combining with a LIMIT, I've found that it performs much better (at least with PostgreSQL 9.1) to JOIN with a random ordering rather than to directly order the actual rows: e.g.
SELECT * FROM tbl_post AS t
JOIN ...
JOIN ( SELECT id, CAST(-2147483648 * RANDOM() AS integer) AS rand
FROM tbl_post
WHERE create_time >= 1349928000
) r ON r.id = t.id
WHERE create_time >= 1349928000 AND ...
ORDER BY r.rand
LIMIT 100
Just make sure that the 'r' generates a 'rand' value for every possible key value in the complex query which is joined with it but still limit the number of rows of 'r' where possible.
The CAST as Integer is especially helpful for PostgreSQL 9.2 which has specific sort optimisation for integer and single precision floating types.
For MySQL to get random record
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
More detail http://jan.kneschke.de/projects/mysql/order-by-rand/
With SQL Server 2012+ you can use the OFFSET FETCH query to do this for a single random row
select * from MyTable ORDER BY id OFFSET n ROW FETCH NEXT 1 ROWS ONLY
where id is an identity column, and n is the row you want - calculated as a random number between 0 and count()-1 of the table (offset 0 is the first row after all)
This works with holes in the table data, as long as you have an index to work with for the ORDER BY clause. Its also very good for the randomness - as you work that out yourself to pass in but the niggles in other methods are not present. In addition the performance is pretty good, on a smaller dataset it holds up well, though I've not tried serious performance tests against several million rows.
Random function from the sql could help. Also if you would like to limit to just one row, just add that in the end.
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows) and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT and ORDER BY are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID() or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare #rows int
select #rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare #skip int = convert(int, #rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (#skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare #sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select #sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare #rows int
select #rows = count(1) from t
declare #sample_size int = 1000
declare #sample_percent decimal(7, 4) = case
when #rows <= 1000 then 100 -- not enough rows
when (100.0 * #sample_size / #rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * #sample_size / #rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
SELECT * FROM table ORDER BY RAND() LIMIT 1
Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.
There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.
The following solution works on PostgreSQL 8.4:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
limit 1;
I above solution you guess 10 various random index values from range 0 .. [last value of id].
The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.
There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
union all (select * from cms_refs order by random() limit 1)
limit 1;
Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!
You may also try using new id() function.
Just write a your query and use order by new id() function. It quite random.
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.
In MSSQL (tested on 11.0.5569) using
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
is significantly faster than
SELECT TOP 100 * FROM employee ORDER BY NEWID()
For Firebird:
Select FIRST 1 column from table ORDER BY RAND()
In SQL Server you can combine TABLESAMPLE with NEWID() to get pretty good randomness and still have speed. This is especially useful if you really only want 1, or a small number, of rows.
SELECT TOP 1 * FROM [table]
TABLESAMPLE (500 ROWS)
ORDER BY NEWID()
I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.
I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.
Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.
See: http://www.mssqltips.com/tip.asp?tip=1308
This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.
http://msdn.microsoft.com/en-us/library/ms189108.aspx
It seems that many of the ideas listed still use ordering
However, if you use a temporary table, you are able to assign a random index (like many of the solutions have suggested), and then grab the first one that is greater than an arbitrary number between 0 and 1.
For example (for DB2):
WITH TEMP AS (
SELECT COMLUMN, RAND() AS IDX FROM TABLE)
SELECT COLUMN FROM TABLE WHERE IDX > .5
FETCH FIRST 1 ROW ONLY
A simple and efficient way from http://akinas.com/pages/en/blog/mysql_random_row/
SET #i = (SELECT FLOOR(RAND() * COUNT(*)) FROM table); PREPARE get_stmt FROM 'SELECT * FROM table LIMIT ?, 1'; EXECUTE get_stmt USING #i;
There is better solution for Oracle instead of using dbms_random.value, while it requires full scan to order rows by dbms_random.value and it is quite slow for large tables.
Use this instead:
SELECT *
FROM employee sample(1)
WHERE rownum=1
For SQL Server 2005 and above, extending #GreyPanther's answer for the cases when num_value has not continuous values. This works too for cases when we have not evenly distributed datasets and when num_value is not a number but a unique identifier.
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)
select r.id, r.name from table AS r
INNER JOIN(select CEIL(RAND() * (select MAX(id) from table)) as id) as r1
ON r.id >= r1.id ORDER BY r.id ASC LIMIT 1
This will require a lesser computation time

Faster SQL data retrival with Java and search large data

I have a table with over 100 thousand data consisting of number pairs. A sample of which is shown below.
A B
0010 0010
0010 0011
0010 0019
0010 0056
0011 0010
0011 0011
0011 0019
0011 0040
0019 0010
0019 0058
Here the numbers in Column A has possible pairs present in column B.
Explanation : User will have several of these numbers ranging form 10 -100. Now as we can see for 0010 - 0011 and 0019 is present. So if the user has a list containing 0010 along with 0011 a warning will be shown that this pair is not allowed and vice versa.
How to approach this in Java?
Loading the hash map with all the data doesnot seem to be a good option although the search will be much faster.
please suggest. Thanks
Testcase:
num = 0010; //value from list which user will be passing
test(num){
if(num.equals("0019")||num.equals("0011")) //compairing with database
System.out.println("incompatible pair present");
}
The above example is a very simple pseudo code. The actual problem will me much more complex.
Until the question is more clear...
Handling large amounts of data which are already stored in a database let me give you a recommendation: Whatever you want to do here, consider solving it with SQL instead of Java. Or at least write a SQL with an resulting ResultSet which is easy to evaluate in Java afterwards.
But until the question is not that clear ...
Are you trying to find entries where A is the same value but B is different?
SELECT t1.a, t1.b, t2.b
FROM MyTable t1, MyTable t2
WHERE t1.a = t2.b AND t1.b <> t2.b
If you're worried of running out of heap space, you could try using a persistent cache like ehcache. I suggest you check the actual memory consumed before going in for this solution though
Seems like your problem is limited to a very small domain - why cant you instantiate an two dimensional array of bool and set it to true whenever the indexes of two numbers creates an unsupported combination.
Example for usage:
if (forbiden[10][11] || forbiden[11][10])
{
throw new Exception("pairs of '10' and '11' are not allowed");
}
You can instantiate this array from the database by going over the data once and setting this array. You just need to translate 0010 to 10. You will have junk in Indexes 0-9, but you can eliminate it by "translating" the index by subtracting it from 9.
Does that hit your question?
If I have understood correctly what you want to do…
Create a unique index on t1(a,b). Put the user's new pair in an INSERT statement inside a try block. Catch key violation exceptions (will be aSQLException, possibly a subclass depending on your RDBMS) and explain to the user that is a forbidden pair.
Simple - definitely not scalable solution -- if your ranges really are 0000 - 9999.
Simply have a byte table with 999999 entries.
Each entry consists of a simple 0 for allowed or 1 for not allowed.
You find an entry in the table by logically concatenating the two pair numbers (key = first * 1000 + second).
The more scalable database solution is to create a table with a composite primary key (pair1 and pair2) the mere presence of an entry indicating a disallowed pair.
To clarify the question:
You have a table containing two numbers each record which are declared 'incompatible'.
You have a user list of numbers and you want to check if this list contains 'incompatible numbers'. Right?
Here you go with a simple SQL (took your example from comment):
SELECT *
FROM incompatible
WHERE A IN (1, 14, 67) AND B IN (1, 14, 67);
This SQL returns all incompatibilities. When the resultset is empty then there are no incompatibilities and everything is fine. If you only want to retrieve this fact then you can write SELECT 1 ... instead.
The SQL have to be build dynamically to contain the user's numbers in the IN clauses, of course.
To speed up queries you can create an (unique) index over both columns. So the database can do a index range scan (unique).
If this table does not yet contain a primary key then you should create a primary key over both columns.

Categories