scanindexforward() not working as expected while doing pagination in dynamodb - java

I am using DynamoDBEnchancedAsyncClient to query DynamoDB using GSI and pagination. Below is the code that I am using to achieve the same. I am tying to limit the number of items per page and number of pages sent to the subscriber of the Mono using below code. I need to sort the records in each page in descending order using the timestamp and this is the sort key in my GSI. For this I am using scanIndexForward(false) below. However I am not getting any records in the page even though there are in total 4 records that are present in DynamoDB.
SdkPublisher<Page<Customer>> query = secindex.query(QueryEnhancedRequest.builder().queryConditional(queryconditional).scanIndexForward(false)
.limit(2).build())
Mono.from(PagePublisher.create(query().limit(1)))
secindex is the DynamoDBAsyncIndex which is the GSI . As per the above code, 1 page should be returned to client with 2 records however none are getting returned. Also If I remove scanIndexForward(false) then the result is as expected but sorted in ascending order. How do I make it return limited records in descending order ? Does the pagination work differently when the scanIndexForward() is supplied?

Without 100% knowing what your filters are on your dynamo call, i can only guess - but I've seen this sort of thing many times
so.
Correction limit is applied before the query is returned not after. This was incorrect below - but because of the nature of additional filters being applied after the return this could indeed result in 2 items being returned that are then filtered out and an ultimate return of 0
end correction
Dynamodb Query does not perform any filter/limits on the data before returning it. The only thing a standard query to dynamo can do is check Hash Key/Range Key with some basic Range Key filtering ( gt, lt, between, begins with ect) - all other filters on attributes that are not Hash/Range are done by the SDK you're using after getting back a set of responses.
1 - Query the dynamo with a Hash Key/Range Key combination and any filtering on Range.
2 - All items that match this are sent back - up to 1mb data. Anything more than that needs additional calls
3 - Apply the limit to these results! this was incorrect, this is applied before being returned
4 - Apply the filter to whats been limited!
5 - Then whatever is left is returned.
This means that what happens often when you are using filter conditions on a dynamo query, you may not actually get back what you expect to - because either they are on the next page and what is on the current page, nothing matches the filter so you get back 0.
Since you are also using Limit, when it sorts the data in the same order as the Sort Key (as scan index forward is false) then if the first two values don't match your other filters, you get 0 items back.
I would recommend you try querying all the items without any filters beyond just Hash/Range key - no other attribute filters.
Then filter the response manually on your side.
You also should be aware of the internal pagination of the SDKs for dynamo - it will only return 1mb amount of data from the Dyanmo in a single call. Anything beyond that requires a second call including the LastEvaluatedKey that is returned in the first page of results. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html has more information.
If your system cannot afford to do the filtering itself after the query is called, then you need to re-evaluate your HashKey/SortKey combinations. Dynamo is best aligned in an Access Pattern schema - that is to say, I have X data and I will need Y data, so I will cause X to be a Hash Key, and the Y values to be different Range Keys under that X.
like as an example: User data. You might have a HashKey of "user_id".
Then you have several different patterns for Range_keys
meta# (with attributes of email, username, hashed and salted passwords, ect)
post#1
post#2
post#3
avatar#
and so you make a query on just Hash Key of the user id, you get all the info. Or if you have page with just their posts, you can do a query of hash key of user id and range key (begins with post#)
This is the most important aspect of a good dynamo schema - the ability to do queries on any thing you need with just a Hash Key or a HashKey and RangeKey
With a well understood set of access patterns that you will need, and a dynamo that is set up appropriately, then you should need no filters or limits, because your combination of Hash/Range key queries will be enough.
(this does sometimes mean a duplication of data! You may have the same information in a Post# item as you do in the Meta# item - ie they both contain usernames. This is OK as when you query for a post you need the user name - as well as when you query for the password/username to see if they match. - Dynamo as a NoSQL handles this very well, and very fast - a given Hash/Range key combination is basically considered its own table in terms of access, making queries VERY fast against it.)

Related

Fetching sorted data from server in chunk?

I need to implement the feature where I need to display the customer names in ascending or descending fashion (along with other customer data) from oracle database.
Say I display first 100 names from DB in desc order.
There is button show more which will display next 100 names .
I am planning to fetch next records based on last index . So in step 2 I will fetch 101 to 200 names
But problem here is what if just before step 2, name was updated by some other user.
In that case name can be skipped(if name was updated to X to A) or duplicated((if name was updated to A to Z)) if I fetch records by index in step 2
Consider on first page displayed records names are from Z to X.
How can I handle this scenario where i can display the correct records without skip or duplicate ?
One way I Can think of is to fetch all records ID's in memory (either at webserver memory or cursor memory), store it as temporary result and then return the data from there instead of real data.But if i have million of records athen it will be load on memory either webserver or DB memory.
What is best approach and how do other sites handle this kind of scenario ?
If you really want each user to view a fixed snapshot of the table data, then you will have to do some caching behind the scenes. You have a valid concern of what would happen if, when requesting page 2, serveral new records landed on what would have been page 1, thus causing the same information to be viewed again on page 2. Well, playing the devil's advocate, I could also argue that a user might also be viewing records which were deleted and are no longer there. This could be equally bad in terms of user experience.
The way I have usually seen this problem handled is to just do a fresh query for each page. Since you are using Oracle, you would likely be using OFFSET and FETCH. It is possible that there could be a duplicated/missing record problem, but unless your data is very rapidly changing, it may be a minor one.

How to implement Proper Pagination in millions of records in Lucene

I have over 10 million Documents in my Lucene Indexes and I need to implement PROPER pagination in my application. Each document is a unique record of a College Candidate. Currently I am showing 5 records per page and providing pagination on the front end for the User.
As soon as the search is performed, 5 records are displayed for Page Number 1. Now there are buttons that takes the User to the First Page, Next Page, Previous Page and Last Page.
Now For example my search query has total hits of 10 million, and when I click on Last Page, I am basically going to Page Number 2000000(2 Million). In the back end I am passing pageNumber*5 as the maxSearch(int) in lucene search Function. This takes so much of time to fetch the results.
Please refer screenshot to see the result on front end
And this is what I am doing on the back end,
My hits are never calculated. The process gets stuck at search. Kindly suggest me a solution to implement correct solution.
P.S. I am Using Lucene 4.0.0.
Several approaches might help:
Leave all pagination to Lucene
You can avoid manual loop iteration over hits.ScoreDocs as described in accepted answer in Lucene 4 Pagination question.
Cursors
If performance of Lucene-based pagination approach is not enough, you can try to implement cursors:
Any (found) document has a sort position, for example a tuple (sort field value, docId). Second tuple element eliminates same sort value issue.
So you can pass sorting position to next and previous page and use sort filters instead of iteration.
For example:
In first page we see three documents (sorted by date):
(date: 2017-01-01, docId: 10), (date: 2017-02-02, docId: 3), (date: 2017-02-02, docId: 5).
Second page will start from first (by sort) document with (date >= 2017-02-02 OR (date == 2017-02-02 AND docId > 5).
Also it possible to cache this positions for several pages during search.
Pagination issues on changing index
Pagination usually applies to particular index version (if index updated in middle of user interaction with results pagination may provide bad experience -- document positions may vary due to adding and removing rows or modification the sort field value of existing document).
Sometimes we must provide search results "at moment of search", displaying a "snapshot" of index, but it is very tricky to big indexes.
Cursors that stored at client side (commonly as opaque string) can seriously break pagination when index is updated.
Usually, there are a few queries that provide really huge results and page cursors can be cached by backend using WeakMap over coreCacheKey for this queries.
Special last page handling
If and only if "Last page" is a frequent operation, you can sort results in reverse order and obtain last page documents as a reverse of top reversed results.
Keep in the mind same value issue when implementing a reverse order.

GAE Search API: How to prevent the 2000 bytes query limit?

We are using the GAE Search API since quite some time but recently hit the query length limit of 2000 bytes:
java.lang.IllegalArgumentException: query string must not be longer
than 2000 bytes, was 2384
We're basically having documents saved with a secondary id set as an atomic field. Within our query we do some sorting and distance calculations and also exclude docs with those secondary ids matching a list of ids with a NOT statement like following:
... AND NOT sec_id:(x AND y AND ...)
With a certain amount of excluded ids we obviously hit the query length limit. I could split the query into separate ones with the same base query and only use a different set of excluded ids but then the sorting is problematic.
So I am wondering if there is another way to implement this kind of query, preferably with a black and also a white list within one query (AND NOT :(..) & AND :(..)).

Avoiding for loop and try to utilize collection APIs instead (performance)

I have a piece of code from an old project.
The logic (in a high level) is as follows:
The user sends a series of {id,Xi} where id is the primary key of the object in the database.
The aim is that the database is updated but the series of Xi values is always unique.
I.e. if the user sends {1,X1} and in the database we have {1,X2},{2,X1} the input should be rejected otherwise we end up with duplicates i.e. {1,X1},{2,X1} i.e. we have X1 twice in different rows.
In lower level the user sends a series of custom objects that encapsulate this information.
Currently the implementation for this uses "brute-force" i.e. continuous for-loops over input and jdbc resultset to ensure uniqueness.
I do not like this approach and moreover the actual implementation has subtle bugs but this is another story.
I am searching for a better approach, both in terms of coding and performance.
What I was thinking is the following:
Create a Set from the user's input list. If the Set has different size than list, then user's input has duplicates.Stop there.
Load data from jdbc.
Create a HashMap<Long,String> with the user's input. The key is the primary key.
Loop over result set. If HashMap does not contain a key with the same value as ResultSet's row id then add it to HashMap
In the end get HashMap's values as a List.If it contains duplicates reject input.
This is the algorithm I came up.
Is there a better approach than this? (I assume that I am not erroneous on the algorithm it self)
Purely from performance point of view , why not let the database figure out that there are duplicates ( like {1,X1},{2,X1} ) ? Have a unique constraint in place in the table and then when the update statement fails by throwing the exception , catch it and deal with what you would want to do under these input conditions. You may also want to run this as a single transaction just if you need to rollback any partial updates. Ofcourse this is assuming that you dont have any other business rules driving the updates that you havent mentioned here.
With your algorithm , you are spending too much time iterating over HashMaps and Lists to remove duplicates IMHO.
Since you can't change the database, as stated in the comments. I would probably extend out your Set idea. Create a HashMap<Long, String> and put all of the items from the database in it, then also create a HashSet<String> with all of the values from your database in it.
Then as you go through the user input, check the key against the hashmap and see if the values are the same, if they are, then great you don't have to do anything because that exact input is already in your database.
If they aren't the same then check the value against the HashSet to see if it already exists. If it does then you have a duplicate.
Should perform much better than a loop.
Edit:
For multiple updates perform all of the updates on the HashMap created from your database then once again check the Map's value set to see if its' size is different from the key set.
There might be a better way to do this, but this is the best I got.
I'd opt for a database-side solution. Assuming a table with the columns id and value, you should make a list with all the "values", and use the following SQL:
select count(*) from tbl where value in (:values);
binding the :values parameter to the list of values however is appropriate for your environment. (Trivial when using Spring JDBC and a database that supports the in operator, less so for lesser setups. As a last resort you can generate the SQL dynamically.) You will get a result set with one row and one column of a numeric type. If it's 0, you can then insert the new data; if it's 1, report a constraint violation. (If it's anything else you have a whole new problem.)
If you need to check for every item in the user input, change the query to:
select value from tbl where value in (:values)
store the result in a set (called e.g. duplicates), and then loop over the user input items and check whether the value of the current item is in duplicates.
This should perform better than snarfing the entire dataset into memory.

How do I implement Hibernate Pagination using a cursor (so the results stay consistent, despite new data being added to the table being paged)?

Is there any way to maintain a database cursor using Hibernate between web requests?
Basically, I'm trying to implement pagination, but the data that is being paged is consistently changing (i.e. new records are added into the database). We are trying to set it up such that when you do your initial search (returning a maximum of 5000 results), and you page through the results, those same records always appear on the same page (i.e. we're not continuously running the query each time next and previous page buttons are clicked). The way we're currently implementing this is by merely selecting 5000 (at most) primary keys from the table we're paging, storing those keys in memory, and then just using 20 primary keys at a time to fetch their details from the database. However, we want to get away from having to store these keys in memory and would much prefer a database cursor that we just keep going back to and moving backwards and forwards over the cursor to generate pages.
I tried doing this with Hibernate's ScrollableResults but found that I could not call methods like next() and previous() would cause an exception if you within a different web request / Hibernate session (no surprise there).
Is there any way to reattach a ScrollableResults object to a Session, much the same way you would reattach a detached database object to make it persistent?
Never use offset because offset also reads all the data before the offset, which is very inefficient.
You need to order by an indexed unique property and return the last item property's value in your API call and use a WHERE clause to start from where you left. This last item's property value will be your cursor position. For example, a simple paginated query that uses the primary key id as cursor would be like this:
List<MyEntity> entities = entityManager
.createQuery("""
FROM
MyEntity e
WHERE
e.id > :cursorPosition
ORDER BY
e.id ASC
""", MyEntity.class)
.setParameter("cursorPosition", cursorPosition)
.setMaxResults(pageSize)
.getResultList()
The first call to the API, the cursorPosition value can be 0. The second one you will receive from the client the cursor that the client received from the first call. See how Google Maps paginated places query works with the nextPageToken attribute.
Your cursor has to be a string that identifies all parameters of your query. So if you have additional parameters it must be retrievable with the cursor.
I believe you can do this in multiple ways. One way is concatenating all parameters and cursorPosition in a string, encode it in a URL friendly string like Base64 and when receiving back decode it and split the string into the original parameters:
String nextPageToken = Base64.getUrlEncoder()
.encodeToString("indexProperty=id&cursorPos=123&ageBiggerThan=65".getBytes())
Your api call will return a json like this:
{
"items": [ ... ],
"nextPageToken": "aW5kZXhQcm9wZXJ0eT1pZCZjdXJzb3JQb3M9MTIzJmFnZUJpZ2dlclRoYW49NjU="
}
And the client next call:
GET https://www.example.com/api/myservice/v1/myentity?pageToken=aW5kZXhQcm9wZXJ0eT1pZCZjdXJzb3JQb3M9MTIzJmFnZUJpZ2dlclRoYW49NjU=
The part of concatenating and splitting the cursor string may be tiresome, I really don't know if there is a library that handles this work of creating the tokens and parsing it, I am actually in this question because I was looking for it. But my guess is that GSON or Jackson can save you lines of code on this.
Essentially you're on your own for this one. What you want to do is take a look at the OpenSessionInView filter and build your own so that instead of making a new HibernateSession per request, you pull one out of a cache that's associated with the user's web session.
If you don't have a framework like Spring WebFlow that gives you some conversation structure, you're going to need to build that too. Since you probably want some way to manage the lifecycle of that Hibernate session beyond "When the web session expires." You also most likely do not want two user threads from the same web session but different browser tabs sharing a hibernate session. (Hilarity is likely to ensue.)

Categories