Large Dataset Searches In Elastic Search Via Java API - java

I am starting to use Elastic Search for a project and a bit conflicted on how to go about searching. My impression was the real-time search is very fast and impressed with the speed of Kibana but I'm having a horrible time trying to find the best way to query large datasets (upwards of 5 million documents) via Java.
I have read online the best option is to use a Scroll search but it also states this is not intended for real-time search which makes sense when I see a query take upwards of 4 minutes to query 5 million documents (something that would be much faster via a SQL database). Can someone clarify if real-time search in ES is only fast when returning the top results but not when returning large datasets? I also need to clarification that Scroll search with QUERY AND FETCH makes the most sense for large queries, any other tips would be helpful.

Related

When to do indexing in lucene

I have REST service which works with data from database (mongodb). I want to add apache lucene library to implement full text search.
I never used Lucene before so trying to understand how it works be checking tutorials, but still one thing is unclear for me:
When to do indexing of DB data? I have DB, some data is added and removed more often, some is updated rarely. What should be structure that I could do search requests by all up to date data.
Should I update indexes on every data update, or it will be done automatically, and enough to index once? If reindexing should be made, so how often?
If you want live data to be searched then you should add, update and delete data in lucene index at the same time you perform add, update and delete data in your database.
It will perfectly fine just for indexing but do not optimize your index for every operation.
You can optimize your index once in a day or according to your use. Optimizing index will help you for faster search result.
Refer this tutorial to just begin with basic application of lucene.
You can try MongoDBs own Feature for this (see Mongo Docs). This has probably not the flexibility and is not as mighty as Lucene, but it Comes for free.
You really asked the problematic question: "When do indexing?". And the answer depends heavy on your requirements. However, you can look at this post to see how it is technically done: offline, i.e. you will always be more or less behind in indexing.

How to perform searching on large amounts of data efficiently with elastichsearch apis?

I want to use elastichsearch for indexing and searching mechanisms in java. My question is I don't know what I should do if there are large amounts of data in indexing and searching results.
What is the proper searching api for big datas for real time user requests in elastich search? Or do you have any idea about this?
Thanks for help/comments.
At indexing time, you have a bulk API dedicated to performs
lots of operations in one single call.
At search time, you only retrieve 10 results by default. You can use pagination by setting from/size parameters and to browse larger resultsets, you have
a scroll API (documentation here) which
is used a bit like a cursor with a DB.
About the real time nature of your search, be aware that, the results are not visible immediately. you may have to wait up to 1s (refresh_interval default value). You can force this refresh operation or lower the refresh_interval parameter value, but this is costly and should be avoided when indexing lots of documents.

CouchDB data replication

I have 30 GB of twitter data stored in CouchDB. I am aiming to process each tweet in java but the java program is not able to hold such a large data at a time. In order to process the entire dataset, I am planning to divide my entire dataset into smaller ones with the help of filtered replication supported by CouchDb. But, as I am new to couchDB, I am facing a lot of problems in doing so. Any better ideas for doing it are welcome. Thanks.
You can always query couchdb for a dataset that is small enough for your java program, so there should be no reason to replicate subsets to smaller databases. See this stackoverflow answer for a way to get paged results from couchdb. You might even employ couchdb itself for the processing with map/reduce, but that depends on your problem.
Depending on the complexity of the queries and the changes you make when processing your data set you should be fine with one instance.
As the previous poster you can use paged results, I tend to do something different:
I have a document for social likes. The latter always refers to a URL and I want to try and have an update at every 2-3 hours.
I have a view that sorts URL's by the documents by the age of the last update request and the last update.
I query this view so that I exclude the articles that had a request within 30 minutes or have been updated less than 2 hours ago.
I use rabbit MQ when enqueuing the jobs and if these are not picked up within 30 minutes, they expire.

Solr search query time increases as the start keeps on increasing

I have currently over 25 million documents in Solr and the volume will gradually increase. I need to search for the records over such big size of Solr indexes. The query response time is low when the start is low, e.g 0. But as the start increases, e.g 100000 , searching in Solr is also taking time. How can i make the search faster even with high start number over large data sets in Solr? The rows remain constant only the start keeps on increasing. I don't want the response time to increase as the start keeps on increasing instead want the result returned for start=100000 should take the same time as for start=0 with say suppose rows=1000 as this is performance issue. Any help would be appreciated.
The problem you are facing is called Deep Paging. There is a good article about it on solr.pl and an incomplete issue on Solr's tracker.
The solution mentioned in the article will require you to sort your result, if that is not feasible for you the solution will not work. The idea is to sort by a stable attribute, in the article that is price and then filter with a price range, like fq=price:[9000+TO+10000].
If you combine that fq with a suitable start - like start=100030 - you will get better performance, as solr will not collect the documents that do not match the fq.
But you will need to make at least one query in advance to fetch the suitable meta data, like how many docs have been found at all.
With the release of Solr 4.7 a new feature has been introduced Cursors. This has been done exactly to address the problem of Deep Paging. If you still have the problem and you may perform the upgrade to Solr 4.7 this is the best option for you.
Some references about deep paging with Solr
https://lucene.apache.org/solr/guide/7_7/pagination-of-results.html#performance-problems-with-deep-paging

Retrieving data faster from SQL database with hibernate

My application contains a lot of data in the database.
Everyday we are processing around 60K records.
My problem is, since the data is growing everyday is there a way to make the user generated searches from my application faster as it takes quite a bit of time to load the records on to the UI. I am using Java with Spring and Hibernate.
I am trying to improve the user experience as we are getting lots of complaints from the users about the searches being slow.
Appreciate any help.
There is no simple answer to this. It boils down to looking at your application, its schemas and the queries that are generated, and figuring out where the bottlenecks are. Depending on that, the solution might be:
to add indexes to certain tables,
to redesign parts of the data model or the queries,
to reduce the size of the resultsets you are reading (e.g. to use paging),
to make user queries simpler, or
to do something else.

Categories