ListViews and large lists recovered from a remote service

ListViews and large lists recovered from a remote service - java

I need to implement a web service for a feed of videos and consume it with an Android client.
By the way my implementation was a method getVideos(offset,quantity) with a MySQL table that returns the result of the query SELECT * FROM videos ORDER BY id DESC LIMIT offset,quantity where the id is an auto-incremental value.
But, since it is a very active feed I've detected the following erroneous case:
The database have the videos 1,2,3...10.
The Android client request the videos offset=0 , quantity=5 so the items 10,9,8,7,6 are returned. The user start to play some videos and in the meanwhile 5 new videos are published, so now the table contains the items 1,2,3...15 now. Then the user continues scrolling and, when the user reach the end of the list, the client attempts to request the next bundle: offset=5, quantity=5, but the same items are returned, appearing duplicates (or adding nothing) into the ListView.
What if the best approach for this problem?

If you don't want data to repeat then don't use OFFSET, use a where clause on id instead.
Check what's the last id you were given and then run a query like:
SELECT * FROM videos WHERE id<:last_id ORDER BY id DESC LIMIT 0,:quantity
Not only this guarantees the results will not repeat but also it should actually be faster since the db won't have to calculate all the offset rows.
UPDATE
How about getting a maximum value of id column when you make the first query and then adding to WHERE that all the results have to be lower or equal to that original value? That way you won't ever get duplicates unless you update some position. Better yet if you add a modification time column to your rows and use time of the first query. That way you won't show edited rows but at least they won't break the order.
SELECT *
FROM videos
WHERE mtime < :original_query_time
ORDER BY id DESC
LIMIT :offset, :quantitiy;

Related

scanindexforward() not working as expected while doing pagination in dynamodb

I am using DynamoDBEnchancedAsyncClient to query DynamoDB using GSI and pagination. Below is the code that I am using to achieve the same. I am tying to limit the number of items per page and number of pages sent to the subscriber of the Mono using below code. I need to sort the records in each page in descending order using the timestamp and this is the sort key in my GSI. For this I am using scanIndexForward(false) below. However I am not getting any records in the page even though there are in total 4 records that are present in DynamoDB.
SdkPublisher<Page<Customer>> query = secindex.query(QueryEnhancedRequest.builder().queryConditional(queryconditional).scanIndexForward(false)
.limit(2).build())
Mono.from(PagePublisher.create(query().limit(1)))
secindex is the DynamoDBAsyncIndex which is the GSI . As per the above code, 1 page should be returned to client with 2 records however none are getting returned. Also If I remove scanIndexForward(false) then the result is as expected but sorted in ascending order. How do I make it return limited records in descending order ? Does the pagination work differently when the scanIndexForward() is supplied?

Without 100% knowing what your filters are on your dynamo call, i can only guess - but I've seen this sort of thing many times
so.
Correction limit is applied before the query is returned not after. This was incorrect below - but because of the nature of additional filters being applied after the return this could indeed result in 2 items being returned that are then filtered out and an ultimate return of 0
end correction
Dynamodb Query does not perform any filter/limits on the data before returning it. The only thing a standard query to dynamo can do is check Hash Key/Range Key with some basic Range Key filtering ( gt, lt, between, begins with ect) - all other filters on attributes that are not Hash/Range are done by the SDK you're using after getting back a set of responses.
1 - Query the dynamo with a Hash Key/Range Key combination and any filtering on Range.
2 - All items that match this are sent back - up to 1mb data. Anything more than that needs additional calls
3 - Apply the limit to these results! this was incorrect, this is applied before being returned
4 - Apply the filter to whats been limited!
5 - Then whatever is left is returned.
This means that what happens often when you are using filter conditions on a dynamo query, you may not actually get back what you expect to - because either they are on the next page and what is on the current page, nothing matches the filter so you get back 0.
Since you are also using Limit, when it sorts the data in the same order as the Sort Key (as scan index forward is false) then if the first two values don't match your other filters, you get 0 items back.
I would recommend you try querying all the items without any filters beyond just Hash/Range key - no other attribute filters.
Then filter the response manually on your side.
You also should be aware of the internal pagination of the SDKs for dynamo - it will only return 1mb amount of data from the Dyanmo in a single call. Anything beyond that requires a second call including the LastEvaluatedKey that is returned in the first page of results. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html has more information.
If your system cannot afford to do the filtering itself after the query is called, then you need to re-evaluate your HashKey/SortKey combinations. Dynamo is best aligned in an Access Pattern schema - that is to say, I have X data and I will need Y data, so I will cause X to be a Hash Key, and the Y values to be different Range Keys under that X.
like as an example: User data. You might have a HashKey of "user_id".
Then you have several different patterns for Range_keys
meta# (with attributes of email, username, hashed and salted passwords, ect)
post#1
post#2
post#3
avatar#
and so you make a query on just Hash Key of the user id, you get all the info. Or if you have page with just their posts, you can do a query of hash key of user id and range key (begins with post#)
This is the most important aspect of a good dynamo schema - the ability to do queries on any thing you need with just a Hash Key or a HashKey and RangeKey
With a well understood set of access patterns that you will need, and a dynamo that is set up appropriately, then you should need no filters or limits, because your combination of Hash/Range key queries will be enough.
(this does sometimes mean a duplication of data! You may have the same information in a Post# item as you do in the Meta# item - ie they both contain usernames. This is OK as when you query for a post you need the user name - as well as when you query for the password/username to see if they match. - Dynamo as a NoSQL handles this very well, and very fast - a given Hash/Range key combination is basically considered its own table in terms of access, making queries VERY fast against it.)

How pagination works and how it helps in reducing response time when dealing with millions of records?

I am working on a trading application which is deployed on Weblogic and has a limitation that any request that takes more than one minute in processing is killed automatically. The restriction is set at the kernel level and is not application dependent, and is not in our control to change.
There is a search functionality in my application which fails when queried for more than 100 records during a given time frame, and I was assigned a task to see the possible solutions.
The first approach that I suggested was to use pagination instead of querying for all records at the same time. I was told that it won't help as on the database side it would any ways fetch all records at the same time. This was new for me as I had the understanding until now that this is handled on the database side and the query fetches only given number of records per page, and with each previous and next it handles it reducing the overall response time.
I did search a lot before posting this query on how pagination works and how it helps reduce the response time but did not get a concrete answer. So would be really great if somebody can help me explain this. Thanks in advance!!!

The first approach that I suggested was to use pagination instead of querying for all records at the same time. I was told that it won't help as on the database side it would any ways fetch all records at the same time
This is true if you are using LIMIT and OFFSET clause in your query for pagination. In this case, database loads the matched records(matched with WHERE clause) from disk and then applies OFFSET and LIMIT clause. Since databases use B-tree for indexing, it doesn't know to jump to OFFSET record directly without loading matched records to memory.
To load only the page size records, you need to use key based pagination. In this approach we avoid OFFSET clause, instead we use the key of record and LIMIT clause.
Example for key-based pagination:
Let's say you want to paginate the users
Request for first 10 records:
select * from user where userid > 0 order by userid asc limit 10
Let's say last userid in above query is 10.
Request for next 10 records:
select * from user where userid > 10 order by userid asc limit 10

Fetching sorted data from server in chunk?

I need to implement the feature where I need to display the customer names in ascending or descending fashion (along with other customer data) from oracle database.
Say I display first 100 names from DB in desc order.
There is button show more which will display next 100 names .
I am planning to fetch next records based on last index . So in step 2 I will fetch 101 to 200 names
But problem here is what if just before step 2, name was updated by some other user.
In that case name can be skipped(if name was updated to X to A) or duplicated((if name was updated to A to Z)) if I fetch records by index in step 2
Consider on first page displayed records names are from Z to X.
How can I handle this scenario where i can display the correct records without skip or duplicate ?
One way I Can think of is to fetch all records ID's in memory (either at webserver memory or cursor memory), store it as temporary result and then return the data from there instead of real data.But if i have million of records athen it will be load on memory either webserver or DB memory.
What is best approach and how do other sites handle this kind of scenario ?

If you really want each user to view a fixed snapshot of the table data, then you will have to do some caching behind the scenes. You have a valid concern of what would happen if, when requesting page 2, serveral new records landed on what would have been page 1, thus causing the same information to be viewed again on page 2. Well, playing the devil's advocate, I could also argue that a user might also be viewing records which were deleted and are no longer there. This could be equally bad in terms of user experience.
The way I have usually seen this problem handled is to just do a fresh query for each page. Since you are using Oracle, you would likely be using OFFSET and FETCH. It is possible that there could be a duplicated/missing record problem, but unless your data is very rapidly changing, it may be a minor one.

How can I page query database without lost records?

We want to programmably copy all records from one table to another periodically.
Now I use SELECT * FROM users LIMIT 2 OFFSET <offset> for fetch records.
The table records like below:
user_1
user_2
user_3
user_4
user_5
user_6
When I fetched the first page (user_1, user_2), then the record "user_2" was be deleted at the source table.
And now I fetched the second page is (user_4, user_5), the third page is (user_6).
This lead to I lost the records "user_3" at the destination table.
And the real source table may be has 1000 000 records, How can I resolve the problem effectively?

First you should use an unique index on the source table and use it in an order clause to make sure that the order or the rows is consistent over time. Next you do not use offsets but start after the last element fetched.
Something like:
SELECT * FROM users ORDER BY id LIMIT 2;
for the first time, and then
SELECT * FROM users WHERE ID > last_recieved_id ORDER BY id LIMIT 2;
for the next ones.
This will be immune to asynchronous deletions.
I you have no unique index but have a non unique one in your table, you can still apply the above solution with a non-strict comparison operator. You will consistently re-get the last rows and it would certainly break with a limit 2, but it could work for reasonable values.
If you have no index - which is known to cause different other problems - the only reliable way is to have one single big select and use the SQL cursor to page.

Java app continuously check new rows added in table

I have a scenario: we have a big table (it is split into a few small ones), I want to use trigger to track the change. We will insert rows to a tracking table if the big table has insert, update and delete event. I need to build a Java app to continuously check the tracking table to see if there are rows left there, fetch them back, do some computation, update cache and delete them.
My question is what is the most efficient way to implement it?
Some concerns:
Continuous checking DB is not quite good. Maybe sleep one sec each time?
Some rows in tracking table can be grouped together by ID. We only need to deal with distinct ID each time.
Need to limit the return rows, maybe 200 once.

This sounds like you are trying to implement a queue in a database. JMS may be a better choice.
You can periodically poll the table to find entries. If IDs have to be groups together, I assume you need some way of knowing that the ID is complete.
If your IDS are incrementing in size, you can include the next 200 IDs in your query. e.g. WHERE id < {id-up-to} + 200

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

ListViews and large lists recovered from a remote service - java

Related

scanindexforward() not working as expected while doing pagination in dynamodb

How pagination works and how it helps in reducing response time when dealing with millions of records?

Fetching sorted data from server in chunk?

How can I page query database without lost records?

Java app continuously check new rows added in table

Categories

Resources