What is the best way for pagination on mongodb using java - java

I am trying to create a simple pagination in mongodb by below code.
collection.find().skip(n).limit(n);
but doesn't it looks like there will be a performance issue if we see that in java terms first find will return all the records consider i have 2 million records, then it will pass it to skip method then it will be passed to limit method. it means every time this query will be fetching all the db records or mongodb drivers work differently, what i missed?

When talking about pagination in MongoDB, it is easily to write this code:
collection.find().skip(pageSize*(pageNum-1)).limit(pageSize);
Above is the native solution supported by MongoDB, but this is not efficient if there are huge documents in the collection. Suppose you have 100M documents, and you want to get the data from the middle offset(50Mth). MongoDB has to build up the full dataset and walk from the beginning to the specified offset, this will be low performance. As your offset increases, the performance keeps degrade.
The root cause is the skip() command which is not efficient and can not take big benifit from index.
Below is another solution to improve performance on large data pagination:
The typical usage scenario of pagination is that there is a table or list to show data of specified page, and also a 'Previous Page' & 'Next Page' button to load data of previous or next page.
If you got the '_id' of the last document in current page, you can use find() instead of skip(). Use _id > currentPage_LastDocument._id as one of the criteria to find next page data. Here is pseudocode:
//Page 1
collection.find().limit(pageSize);
//Get the _id of the last document in this page
last_id = ...
//Page 2
users = collection.find({'_id': {$gt: last_id}}).limit(pageSize);
//Update the last id with the _id of the last document in this page
last_id = ...
This will avoid MongoDB to walk through large data when using skip().

Related

How to implement Proper Pagination in millions of records in Lucene

I have over 10 million Documents in my Lucene Indexes and I need to implement PROPER pagination in my application. Each document is a unique record of a College Candidate. Currently I am showing 5 records per page and providing pagination on the front end for the User.
As soon as the search is performed, 5 records are displayed for Page Number 1. Now there are buttons that takes the User to the First Page, Next Page, Previous Page and Last Page.
Now For example my search query has total hits of 10 million, and when I click on Last Page, I am basically going to Page Number 2000000(2 Million). In the back end I am passing pageNumber*5 as the maxSearch(int) in lucene search Function. This takes so much of time to fetch the results.
Please refer screenshot to see the result on front end
And this is what I am doing on the back end,
My hits are never calculated. The process gets stuck at search. Kindly suggest me a solution to implement correct solution.
P.S. I am Using Lucene 4.0.0.
Several approaches might help:
Leave all pagination to Lucene
You can avoid manual loop iteration over hits.ScoreDocs as described in accepted answer in Lucene 4 Pagination question.
Cursors
If performance of Lucene-based pagination approach is not enough, you can try to implement cursors:
Any (found) document has a sort position, for example a tuple (sort field value, docId). Second tuple element eliminates same sort value issue.
So you can pass sorting position to next and previous page and use sort filters instead of iteration.
For example:
In first page we see three documents (sorted by date):
(date: 2017-01-01, docId: 10), (date: 2017-02-02, docId: 3), (date: 2017-02-02, docId: 5).
Second page will start from first (by sort) document with (date >= 2017-02-02 OR (date == 2017-02-02 AND docId > 5).
Also it possible to cache this positions for several pages during search.
Pagination issues on changing index
Pagination usually applies to particular index version (if index updated in middle of user interaction with results pagination may provide bad experience -- document positions may vary due to adding and removing rows or modification the sort field value of existing document).
Sometimes we must provide search results "at moment of search", displaying a "snapshot" of index, but it is very tricky to big indexes.
Cursors that stored at client side (commonly as opaque string) can seriously break pagination when index is updated.
Usually, there are a few queries that provide really huge results and page cursors can be cached by backend using WeakMap over coreCacheKey for this queries.
Special last page handling
If and only if "Last page" is a frequent operation, you can sort results in reverse order and obtain last page documents as a reverse of top reversed results.
Keep in the mind same value issue when implementing a reverse order.

Google App-Engine Datastore is extremely slow

I need help in understanding why the below code is taking 3 to 4 seconds.
UPDATE: Use case for my application is to get the activity feed of a person since last login. This feed could contain updates from friends or some new items outside of his network that he may find interesting. The Activity table stores all such activities and when a user logs in, I run a query on the GAE-DataStore to return above activities. My application supports infinite scrolling too, hence I need the cursor feature of GAE. At a given time, I get around 32 items but the activities table could have millions of rows (as it contains data from all the users).
Currently the Activity table is small and contains 25 records only and the below java code reads only 3 records from the same table.
Each record in the Activity table has 4 UUID fields.
I cannot imagine how the query would behave if the table contained millions of rows and result contained 100s of rows.
Is there something wrong with the below code I have below?
(I am using Objectify and app-engine cursors)
Filter filter = new FilterPredicate("creatorID", FilterOperator.EQUAL, userId);
Query<Activity> query = ofy().load().type(Activity.class).filter(filter);
query = query.startAt(Cursor.fromWebSafeString(previousCursorString));
QueryResultIterator<Activity> itr = query.iterator();
while (itr.hasNext())
{
Activity a = itr.next();
System.out.println (a);
}
I have gone through Google App Engine Application Extremely slow and verified that response time improves if I keep on refreshing my page (which calls the above code). However, the improvement is only ~30%
Compare this with any other database and the response time for such tiny data is in milliseconds, not even 100s of milliseconds.
Am I wrong in expecting a regular database kind of performance from the GAE DataStore?
I do not want to turn on memcache just yet as I want to improve this layer without caching first.
Not exactly sure what your query is supposed to do but it doesn't look like it requires a cursor query. In my humble opinion the only valid use case for cursor queries is a paginated query for data with a limited count of result rows. Since your query does not have a limit i don't see why you would want to use a cursor at all.
When you need millions of results you're probably doing ad-hoc analysis of data (as no human could ever interpret millions of raw data rows) you might be better off using BigQuery instead of the appengine datastore. I'm just guessing here, but for normal front end apps you rarely need millions of rows in a result but only a few (maybe hundreds at times) which you filter from the total available rows.
Another thing:
Are you sure that it is the query that takes long? It might as well be the wrapper around the query. Since you are using cursors you would have to recall the query until there are no more results. The handling of this could be costly.
Lastly:
Are you testing on appengine itself or on the local development server? The devserver can obviouily not simulate a cloud and thus could be slower (or faster) than the real thing at times. The devserver does not know about instance warmup times either when your query spawns new instances.
Speaking of cloud: The thing about cloud databases is not that they have the best performance for very little data but that they scale and perform consistently with a couple of hundreds and a couple of billions of rows.
Edit:
After performing a retrieval operation, the application can obtain a
cursor, which is an opaque base64-encoded string marking the index
position of the last result retrieved.
[...]
The cursor's position is defined as the location in the result list
after the last result returned. A cursor is not a relative position in
the list (it's not an offset); it's a marker to which the Datastore
can jump when starting an index scan for results. If the results for a
query change between uses of a cursor, the query notices only changes
that occur in results after the cursor. If a new result appears before
the cursor's position for the query, it will not be returned when the
results after the cursor are fetched.
(Datastore Queries)
These two statements make be believe that the query performance should be consistent with or without cursor queries.
Here are some more things you might want to check:
How do you register your entity classes with objectify?
What does your actual test code look like? I'd like to see how and where you measure.
Can you share a comparison between cursor query and query without cursors?
Improvement with multiple request could be the result of Objectifys integrated caching. You might want to disable caching for datastore performance tests

Database entry for every webpage view (analytics)

I am playing around (learning experience) with writing an analytic system using the Play! Framework(2)(java),
I want to write efficient code and due to this I am struggling to decide on the following:
For every view a page gets there is a record being added, specifying the website (example.org) , page (/index.html) and the date that was viewed. As you can guess, the amount of rows is going to be huge.
To use the data I am then selecting all rows where the website is "example.org", looping through the results to build a hash map containing the date and how many views it had on that date and then using this to build a graph.
There must be a more better way of doing this,
For example, rather than having a row per view would it be better to update an existing record adding one view to the record.
Any assistance would be appreciated.
Thank you
You can just add some more conditions (like the date) in your WHERE clause, then you can perform a Count over the result. This way you'll have directly the result from your database.
The query would look like:
SELECT COUNT(*)
FROM YOUR_TABLE
WHERE SITE = 'thesite'
AND DATE = '<date>'
GROUP BY SITE, DATE
There must be a better way of doing this,
The web server logs HTML requests. Most analytical systems use the web server logs.
Since you mentioned that you're doing this to learn, you're gathering statistics in the most flexible way possible.
My only suggestion would be to remove all indexes from the statistics table that your web applications are writing to. Make a copy of the statistics table for generating the statistics. The copy would have all of the necessary indexes.
This way, you get the fastest writes, because there are no indexes to update.
If necessary, you can have a primary index or clustering index on the write table.

Jpa paging with numbers and next, previous

I apologize for asking this question if someone already asked it and for asking a perhaps dumb question but I am new to this and you are experts and I would love to learn from your expert advice and experience.
I want to add paging to an application displaying 200 records per page. The bottom of the page should have numbers. If there are 1050, the first page will display 100 and the bottom of the page will show numbers 1,2,3,4,5 & 6.
What is the general logic to accomplish this?
I know that the database must select 200 every time and I must keep track of the first record.
Is there a way to know how many records will be returned total so that I can know how many numbers to display on the bottom of the page? Does it require selecting a count() statement or something else?
For the 1050 records, The numbers 1,2,3,4,5 & 6 will display and clicking each one requires a call to the server. Is there a way to know how many records will be returned in the next call to the server? Does it require selecting a count() statement or something else?
You can use the JPA Criteria API to accomplish this. Assuming TypedQuery, you would use setFirstResult and setLastResult to limit the records returned from the database. Of course the values for these methods will be dependent on what page was requested, and how many records are getting displayed per page.
first = (page - 1) * page size;
last = (page * size) - 1;
Note: This assumes the first page is 1 (as opposed to zero).
To get a record count you execute a standard Criteria query. As an example:
final CriteriaBuilder builder = entityManager.getCriteriaBuilder();
final CriteriaQuery<Long> countQuery = builder.createQuery(Long.class);
countQuery.select(builder.count(countQuery.from(MyEntity.class)));
final Long count = entityManager.createQuery(countQuery)
.getSingleResult();
You can customize the above code to perform the count relative to another query as well. Finally, you need some way of communicating the total count back to the client. One way of doing this is wrapping the result set of your query in another object that contains an attribute for the count, or returns said object as a result of your DAO call. Or alternatively, you could store the count attribute in an object in your request scope.
public class ResultListWrapper<T> {
private Long count;
private Collection<T> results;
// constructor, getters, setters
}
You will usually perform the same query except using count int the select list instead of the columns prior running the actual query to find how many pages there are. Getting a specific page in hibernate can then be done with something like:
int maxRecords = // page size ...
int startRow = // page number * page size
Query query = session.createQuery('...');
query.setMaxResults(maxRecords);
query.setFirstResult(startRow);
If performing the extra query is too expensive an operation then you could consider just providing next/previous links or alternatively providing extra pages as needed (e.g. As the last of the loaded data comes into view) via ajax.
For displaying paged results including showing page number links I would suggest (assuming you are using jsp pages to display this data) using the display tag JSP tag library. Display tag handles paging and sorting display of tabular data assuming you can get it to the JSP.
Yes, you have to do a select count before getting the rows themselves.
Have a look at Spring Data JPA. It's basically working how Perception suggested earlier, just that you don't have to do it, the framework does that for you. You get paging and sorting by default (you still need to do filtering yourself if you don't want to use Query DSL).
They have a 1h video on the main page that's showing a quick intro, but you'll only need to watch about 15 mins to get an idea of what id does.

java - MongoDB + Solr performances

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:
get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after
Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
false in order to have good search performance, or
true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing
<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>

Categories