Fetching sorted data from server in chunk? - java

I need to implement the feature where I need to display the customer names in ascending or descending fashion (along with other customer data) from oracle database.
Say I display first 100 names from DB in desc order.
There is button show more which will display next 100 names .
I am planning to fetch next records based on last index . So in step 2 I will fetch 101 to 200 names
But problem here is what if just before step 2, name was updated by some other user.
In that case name can be skipped(if name was updated to X to A) or duplicated((if name was updated to A to Z)) if I fetch records by index in step 2
Consider on first page displayed records names are from Z to X.
How can I handle this scenario where i can display the correct records without skip or duplicate ?
One way I Can think of is to fetch all records ID's in memory (either at webserver memory or cursor memory), store it as temporary result and then return the data from there instead of real data.But if i have million of records athen it will be load on memory either webserver or DB memory.
What is best approach and how do other sites handle this kind of scenario ?

If you really want each user to view a fixed snapshot of the table data, then you will have to do some caching behind the scenes. You have a valid concern of what would happen if, when requesting page 2, serveral new records landed on what would have been page 1, thus causing the same information to be viewed again on page 2. Well, playing the devil's advocate, I could also argue that a user might also be viewing records which were deleted and are no longer there. This could be equally bad in terms of user experience.
The way I have usually seen this problem handled is to just do a fresh query for each page. Since you are using Oracle, you would likely be using OFFSET and FETCH. It is possible that there could be a duplicated/missing record problem, but unless your data is very rapidly changing, it may be a minor one.

Related

Performance of database call from JAVA to Database

Our team is building a small application wherein a UI has about 10 drop-down list boxes. ( DDLB ).
These list boxes will be populated by selecting data from different tables.
Our JAVA person feels that making separate database call for each list will be very expensive and wants to make a single database call for all lists.
I feel it is impractical to populate all lists in one database call due to following reason
a. Imagine an end user chooses state = 'NY' from one DDLB.
b. The next drop down should be populated with values from ZIP_CODES table for STATE='NY'
Unless we know ahead of time what state a user will be choosing - our only choice is to populate a java structure with all values from ZIP_CODES table. And after the user has selected the state - parse this structure for NY zipcodes.
And imagine doing this for all the DDLB in the form. This will not only be practical but also resource intensive.
Any thoughts ?
If there are not many items in those lists and memory amount allows you could load all values for all drop boxes into memory at application startup and then filter data in memory. It will be better then execute SQL query for every action user makes with those drop boxes.
You could also use some cache engines (like EhCache) that could offload data to disk and store only some fraction in memory.
You can run some timings to see, but I suspect you're sweating something that might take 100th of a second to execute. UI design wise I never put zip codes in selection menus because the list is too long and people already know it well enough to just punch in. When they leave the zip code field I will query the city and state and pre-fill those fields if they're not already set.

Processing a big list from DB in Java

I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.
It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.
Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .
If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.

Designing HBase schema to best support specific queries

I have an HBase schema-design related question. The problem is fairly simple - I am storing "notifications" in hbase, each of which has a status ("new", "seen", and "read"). Here are the API's I need to provide:
Get all notifications for a user
Get all "new" notifications for a user
Get the count of all "new" notifications for a user
Update status for a notification
Update status for all of a user's notifications
Get all "new" notifications accross the database
Notifications should be scannable in reverse chronological order and allow pagination.
I have a few ideas, and I wanted to see if one of them is clearly best, or if I have missed a good strategy entirely. Common to all three, I think having one row per notification and having the user id in the rowkey is the way to go. To get chronological ordering for pagination, I need to have a reverse timestamp in there, too. I'd like to keep all notifs in one table (so I don't have to merge sort for the "get all notificatiosn for a user" call) and don't want to write batch jobs for secondary index tables (since updates to the count and status should be in real time).
The simplest way to do it would be (1) row key is "userId_reverseTimestamp" and do filtering for status on the client side. This seems naive, since we will be sending lots of unecessary data through the network.
The next possibility is to (2) encode the status into the rowkey as well, so either "userId_reverseTimestamp_status" and then doing rowkey regex filtering on the scans. The first issue I see is needing to delete a row and copy the notification data to a new row when status changes (which presumably, should happen exactly twice per notification). Also, since the status is the last part of the rowkey, for each user, we will be scanning lots of extra rows. Is this a big performance hit? Finally, in order to change status, I will need to know what the previous status was (to build the row key) or else I will need to do another scan.
The last idea I had is to (3) have two column families, one for the static notif data, and one as a flag for the status, i.e. "s:read" or "s:new" with 's' as the cf and the status as the qualifier. There would be exactly one per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/ ColumnPrefixFilter against that cf. Here too, I would have to delete and create columns on status change, but it should be much more lightweight than copying whole rows. My only concern is the warning in the HBase book that HBase doesn't do well with "more than 2 or 3 column families" - perhaps if the system needs to be extended with more querying capabilities, the multi-cf strategy won't scale.
So (1) seems like it would have too much network overhead. (2) seems like it would have wasted cost spent copying data and (3) might cause issues with too many families. Between (2) and (3), which type of filter should give better performance? In both cases, the scan will have look at each row for a user, which presumably has mostly read notifications - which would have better performance. I think I'm leaning towards (3) - are there other options (or tweaks) that I have missed?
You have put a lot of thought into this and I think all three are reasonable!
You want to have your main key be the username concatenated with the time stamp since most of your queries are "by user". This will help with easy pagination with a scan and can fetch user information pretty quickly.
I think the crux of your problem is this changing status part. In general, something like a "read" -> "delete" -> "rewrite" introduces all kinds of concurrency issues. What happens if your task fails between? Do you have data in an invalid state? Will you drop a record?
I suggest you instead treat the table as "append only". Basically, do what you suggest for #3, but instead of removing the flag, keep it there. If something has been read, it can have the three "s:seen", "s:read" there (if it is new, we can just assume it is empty). You could also be fancy and put a timestamp in each of the three to show when that event was satisfied. You shouldn't see much of a performance hit from doing this and then you don't have to worry about concurrency, since all operations are write-only and atomic.
I hope this is helpful. I'm not sure if I answered everything since your question was so broad. Please follow up with addition questions and I'll love to elaborate or discuss something else.
My solution is:
Don't save notifications status (seen, new) in hbase for each notification. For the notifications use simple schema. Key:userid_timestamp - column: notification_message.
Once client asks API "Get all new notifications", save the timestamp (All new notifications pushed). Key: userid - colimn: All_new_notifications_pushed_time
Every notification with timestamp is lower than "All new notifications pushed" assumed "seen", and if bigger assume "New"
To get all new notifications:
firstly get value (timestamp) for All_new_notifications_pushed_time by userid
then perform range scan on notification_message column by key: from current_timestamp to All_new_notifications_pushed_time.
This will significantly limit affected columns, and most of them should be in memstore.
Count the new notifications on the client.

How to Iterate across records in a MySql Database using Java

I have a customer with a very small set of data and records that I'd normally just serialize to a data file and be done but they want to run extra reports and have expandability down the road to do things their own way. The MySQL database came up and so I'm adapting their Java POS (point of sale) system to work with it.
I've done this before and here was my approach in a nutshell for one of the tables, say Customers:
I setup a loop to store the primary key into an arraylist then setup a form to go from one record to the next running SQL queries based on the PK. The query would pull down the fname, lname, address, etc. and fill in the fields on the screen.
I thought it might be a little clunky running a SQL query each time they click Next. So I'm looking for another approach to this problem. Any help is appreciated! I don't need exact code or anything, just some concepts will do fine
Thanks!
I would say the solution you suggest yourself is not very good not only because you run SQL query every time a button is pressed, but also because you are iterating over primary keys, which probably are not sorted in any meaningful order...
What you want is to retrieve a certain number of records which are sorted sensibly (by first/last name or something) and keep them as a kind of cache in your ArrayList or something similar... This can be done quite easily with SQL. When the user starts iterating over the results by pressing "Next", you can in the background start loading more records.
The key to keep usability is to load some records before the user actually request them to keep latency small, but keeping in mind that you also don't want to load the whole database at once....
Take a look at indexing your database. http://www.informit.com/articles/article.aspx?p=377652
Use JPA with the built in Hibernate provider. If you are not familiar with one or both, then download NetBeans - it includes a very easy to follow tutorial you can use to get up to speed. Managing lists of objects is trivial with the new JPA and you won't find yourself reinventing the wheel.
the key concept here is pagination.
Let's say you set your page size to 10. This means you select 10 records from the database, in a certain order, so your query should have an order by clause and a limit clause at the end. You use this resultset to display the form while the users navigates with Previous/Next buttons.
When the user navigates out of the page then you fetch an other page.
https://www.google.com/search?q=java+sql+pagination

How to implement several threads in Java for downloading a single table data?

How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.

Categories