Returning prioritized documents in Elasticsearch

Returning prioritized documents in Elasticsearch - java

I am trying to write an ES searching method that returns records/documents that have `Tickets to the top of the list. So for example I do a search on 'Smith' and I get 200 results. I want all results ordered so that those who have tickets appear first. I have this query working, however whenever I try and sort by first name, the query no longer sorts by those with tickets and first name.
Does anyone have any thoughts on how I might keep the original sort order (by Tickets)?

Related

scanindexforward() not working as expected while doing pagination in dynamodb

I am using DynamoDBEnchancedAsyncClient to query DynamoDB using GSI and pagination. Below is the code that I am using to achieve the same. I am tying to limit the number of items per page and number of pages sent to the subscriber of the Mono using below code. I need to sort the records in each page in descending order using the timestamp and this is the sort key in my GSI. For this I am using scanIndexForward(false) below. However I am not getting any records in the page even though there are in total 4 records that are present in DynamoDB.
SdkPublisher<Page<Customer>> query = secindex.query(QueryEnhancedRequest.builder().queryConditional(queryconditional).scanIndexForward(false)
.limit(2).build())
Mono.from(PagePublisher.create(query().limit(1)))
secindex is the DynamoDBAsyncIndex which is the GSI . As per the above code, 1 page should be returned to client with 2 records however none are getting returned. Also If I remove scanIndexForward(false) then the result is as expected but sorted in ascending order. How do I make it return limited records in descending order ? Does the pagination work differently when the scanIndexForward() is supplied?

Without 100% knowing what your filters are on your dynamo call, i can only guess - but I've seen this sort of thing many times
so.
Correction limit is applied before the query is returned not after. This was incorrect below - but because of the nature of additional filters being applied after the return this could indeed result in 2 items being returned that are then filtered out and an ultimate return of 0
end correction
Dynamodb Query does not perform any filter/limits on the data before returning it. The only thing a standard query to dynamo can do is check Hash Key/Range Key with some basic Range Key filtering ( gt, lt, between, begins with ect) - all other filters on attributes that are not Hash/Range are done by the SDK you're using after getting back a set of responses.
1 - Query the dynamo with a Hash Key/Range Key combination and any filtering on Range.
2 - All items that match this are sent back - up to 1mb data. Anything more than that needs additional calls
3 - Apply the limit to these results! this was incorrect, this is applied before being returned
4 - Apply the filter to whats been limited!
5 - Then whatever is left is returned.
This means that what happens often when you are using filter conditions on a dynamo query, you may not actually get back what you expect to - because either they are on the next page and what is on the current page, nothing matches the filter so you get back 0.
Since you are also using Limit, when it sorts the data in the same order as the Sort Key (as scan index forward is false) then if the first two values don't match your other filters, you get 0 items back.
I would recommend you try querying all the items without any filters beyond just Hash/Range key - no other attribute filters.
Then filter the response manually on your side.
You also should be aware of the internal pagination of the SDKs for dynamo - it will only return 1mb amount of data from the Dyanmo in a single call. Anything beyond that requires a second call including the LastEvaluatedKey that is returned in the first page of results. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html has more information.
If your system cannot afford to do the filtering itself after the query is called, then you need to re-evaluate your HashKey/SortKey combinations. Dynamo is best aligned in an Access Pattern schema - that is to say, I have X data and I will need Y data, so I will cause X to be a Hash Key, and the Y values to be different Range Keys under that X.
like as an example: User data. You might have a HashKey of "user_id".
Then you have several different patterns for Range_keys
meta# (with attributes of email, username, hashed and salted passwords, ect)
post#1
post#2
post#3
avatar#
and so you make a query on just Hash Key of the user id, you get all the info. Or if you have page with just their posts, you can do a query of hash key of user id and range key (begins with post#)
This is the most important aspect of a good dynamo schema - the ability to do queries on any thing you need with just a Hash Key or a HashKey and RangeKey
With a well understood set of access patterns that you will need, and a dynamo that is set up appropriately, then you should need no filters or limits, because your combination of Hash/Range key queries will be enough.
(this does sometimes mean a duplication of data! You may have the same information in a Post# item as you do in the Meta# item - ie they both contain usernames. This is OK as when you query for a post you need the user name - as well as when you query for the password/username to see if they match. - Dynamo as a NoSQL handles this very well, and very fast - a given Hash/Range key combination is basically considered its own table in terms of access, making queries VERY fast against it.)

How to implement Proper Pagination in millions of records in Lucene

I have over 10 million Documents in my Lucene Indexes and I need to implement PROPER pagination in my application. Each document is a unique record of a College Candidate. Currently I am showing 5 records per page and providing pagination on the front end for the User.
As soon as the search is performed, 5 records are displayed for Page Number 1. Now there are buttons that takes the User to the First Page, Next Page, Previous Page and Last Page.
Now For example my search query has total hits of 10 million, and when I click on Last Page, I am basically going to Page Number 2000000(2 Million). In the back end I am passing pageNumber*5 as the maxSearch(int) in lucene search Function. This takes so much of time to fetch the results.
Please refer screenshot to see the result on front end
And this is what I am doing on the back end,
My hits are never calculated. The process gets stuck at search. Kindly suggest me a solution to implement correct solution.
P.S. I am Using Lucene 4.0.0.

Several approaches might help:
Leave all pagination to Lucene
You can avoid manual loop iteration over hits.ScoreDocs as described in accepted answer in Lucene 4 Pagination question.
Cursors
If performance of Lucene-based pagination approach is not enough, you can try to implement cursors:
Any (found) document has a sort position, for example a tuple (sort field value, docId). Second tuple element eliminates same sort value issue.
So you can pass sorting position to next and previous page and use sort filters instead of iteration.
For example:
In first page we see three documents (sorted by date):
(date: 2017-01-01, docId: 10), (date: 2017-02-02, docId: 3), (date: 2017-02-02, docId: 5).
Second page will start from first (by sort) document with (date >= 2017-02-02 OR (date == 2017-02-02 AND docId > 5).
Also it possible to cache this positions for several pages during search.
Pagination issues on changing index
Pagination usually applies to particular index version (if index updated in middle of user interaction with results pagination may provide bad experience -- document positions may vary due to adding and removing rows or modification the sort field value of existing document).
Sometimes we must provide search results "at moment of search", displaying a "snapshot" of index, but it is very tricky to big indexes.
Cursors that stored at client side (commonly as opaque string) can seriously break pagination when index is updated.
Usually, there are a few queries that provide really huge results and page cursors can be cached by backend using WeakMap over coreCacheKey for this queries.
Special last page handling
If and only if "Last page" is a frequent operation, you can sort results in reverse order and obtain last page documents as a reverse of top reversed results.
Keep in the mind same value issue when implementing a reverse order.

Retrieve first document from collection using findOne() MongoDB

I am creating a java program to process the Collection of MongoDB as queue. So when I dequeue, I want the document that was inserted first.
To do that so, I have a field called created, which represents the time stamp for the document creation, and my initial idea was to use aggregation $min to find the smallest document using created field.
However it occurred to me why not use findOne() without any argument. It will always return the first document in the collection.
So my question is should I do that? Would it be a good approach to use findOne() and dequeue first record from the Mongo Queue? And what are the drawback if I do that so.
PS: The Mongo Queue program is created to serve the requests of the devices on basis of First Come First Serve. But as it would take some time to execute the request and device can't accept another request while it is processing one. So to prevent the drop of one request I am using the queue to process request one by one.

Interesting how many people here commented incorrectly, but you are right in that a raw .findOne() with a blank query or .findOne({}) will return the first document in the collection, that being "the document with the lowest _id value".
Ideally for a queue processing system, you want to remove the document at the same time as doing this. For this purpose the Java API supports a .findAndRemove() method:
DBCollection data = mongoOperation.getCollection("data");
DBObject removed = data.findAndRemove(new DBObject());
So that will return the first document in the collection as described and "remove" it from the collection so that no other operations can find it.
You can call .findAndModify() and set all the options yourself alternately, but if all you are after is the "oldest document first" which is what the _id guarantees then this is all you want.

findOne returns element in natural order. This is not necessarily same as insertion order. It is the order in which document appears in the disk. It may appear that it is being retrieved in insertion order but with deletes and inserts, you will start seeing document appear out of order.
One of the ways to guarantee that elements always appear in insertion order is to use capped collections. If your application is not impacted by its restrictions, it might be the simplest way to get a queue implemented with capped collection.
Capped collections can also be used with tailable cursor so that the logic that is retrieving items from the queue can continue to wait for items if no items are available to process.
Update: If you can not use capped collection you would have to sort the result by _id if it is ObjectId or keep timestamp based field in collection and order the result by that field.

FindOne returns using the $natural order within the internal MongoDB bTree that exists behind the scenes.
The function does not, by default, sort by _id and nor will it pick the lowest _id.
If you find it returns the lowest _id regularly then that is because of document positioning within the $natural index.
Getting the first document of the collection and the first document of a sorted set are two totally different things.
If you wanted to use findAndModify to grab a document off the pile, which I personally would recommend a optimistic lock then you would need to use:
findAndModify({
sort: {_id: -1},
remove: true
})
The reason why I would not commend this approach is because of that process crashes or the server goes down in the distributed worker set then you have lost that data point. Instead you want a temporary (optimistic type) lock which can be released in the event that it has not been processed correctly.

ListViews and large lists recovered from a remote service

I need to implement a web service for a feed of videos and consume it with an Android client.
By the way my implementation was a method getVideos(offset,quantity) with a MySQL table that returns the result of the query SELECT * FROM videos ORDER BY id DESC LIMIT offset,quantity where the id is an auto-incremental value.
But, since it is a very active feed I've detected the following erroneous case:
The database have the videos 1,2,3...10.
The Android client request the videos offset=0 , quantity=5 so the items 10,9,8,7,6 are returned. The user start to play some videos and in the meanwhile 5 new videos are published, so now the table contains the items 1,2,3...15 now. Then the user continues scrolling and, when the user reach the end of the list, the client attempts to request the next bundle: offset=5, quantity=5, but the same items are returned, appearing duplicates (or adding nothing) into the ListView.
What if the best approach for this problem?

If you don't want data to repeat then don't use OFFSET, use a where clause on id instead.
Check what's the last id you were given and then run a query like:
SELECT * FROM videos WHERE id<:last_id ORDER BY id DESC LIMIT 0,:quantity
Not only this guarantees the results will not repeat but also it should actually be faster since the db won't have to calculate all the offset rows.
UPDATE
How about getting a maximum value of id column when you make the first query and then adding to WHERE that all the results have to be lower or equal to that original value? That way you won't ever get duplicates unless you update some position. Better yet if you add a modification time column to your rows and use time of the first query. That way you won't show edited rows but at least they won't break the order.
SELECT *
FROM videos
WHERE mtime < :original_query_time
ORDER BY id DESC
LIMIT :offset, :quantitiy;

Avoiding for loop and try to utilize collection APIs instead (performance)

I have a piece of code from an old project.
The logic (in a high level) is as follows:
The user sends a series of {id,Xi} where id is the primary key of the object in the database.
The aim is that the database is updated but the series of Xi values is always unique.
I.e. if the user sends {1,X1} and in the database we have {1,X2},{2,X1} the input should be rejected otherwise we end up with duplicates i.e. {1,X1},{2,X1} i.e. we have X1 twice in different rows.
In lower level the user sends a series of custom objects that encapsulate this information.
Currently the implementation for this uses "brute-force" i.e. continuous for-loops over input and jdbc resultset to ensure uniqueness.
I do not like this approach and moreover the actual implementation has subtle bugs but this is another story.
I am searching for a better approach, both in terms of coding and performance.
What I was thinking is the following:
Create a Set from the user's input list. If the Set has different size than list, then user's input has duplicates.Stop there.
Load data from jdbc.
Create a HashMap<Long,String> with the user's input. The key is the primary key.
Loop over result set. If HashMap does not contain a key with the same value as ResultSet's row id then add it to HashMap
In the end get HashMap's values as a List.If it contains duplicates reject input.
This is the algorithm I came up.
Is there a better approach than this? (I assume that I am not erroneous on the algorithm it self)

Purely from performance point of view , why not let the database figure out that there are duplicates ( like {1,X1},{2,X1} ) ? Have a unique constraint in place in the table and then when the update statement fails by throwing the exception , catch it and deal with what you would want to do under these input conditions. You may also want to run this as a single transaction just if you need to rollback any partial updates. Ofcourse this is assuming that you dont have any other business rules driving the updates that you havent mentioned here.
With your algorithm , you are spending too much time iterating over HashMaps and Lists to remove duplicates IMHO.

Since you can't change the database, as stated in the comments. I would probably extend out your Set idea. Create a HashMap<Long, String> and put all of the items from the database in it, then also create a HashSet<String> with all of the values from your database in it.
Then as you go through the user input, check the key against the hashmap and see if the values are the same, if they are, then great you don't have to do anything because that exact input is already in your database.
If they aren't the same then check the value against the HashSet to see if it already exists. If it does then you have a duplicate.
Should perform much better than a loop.
Edit:
For multiple updates perform all of the updates on the HashMap created from your database then once again check the Map's value set to see if its' size is different from the key set.
There might be a better way to do this, but this is the best I got.

I'd opt for a database-side solution. Assuming a table with the columns id and value, you should make a list with all the "values", and use the following SQL:
select count(*) from tbl where value in (:values);
binding the :values parameter to the list of values however is appropriate for your environment. (Trivial when using Spring JDBC and a database that supports the in operator, less so for lesser setups. As a last resort you can generate the SQL dynamically.) You will get a result set with one row and one column of a numeric type. If it's 0, you can then insert the new data; if it's 1, report a constraint violation. (If it's anything else you have a whole new problem.)
If you need to check for every item in the user input, change the query to:
select value from tbl where value in (:values)
store the result in a set (called e.g. duplicates), and then loop over the user input items and check whether the value of the current item is in duplicates.
This should perform better than snarfing the entire dataset into memory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.