I have to implement an autocomplete with over 500,000 names which may later increase to over 4 million names.
Backend is a java REST web service call using Spring. Should I use MongoDB, Redis or Elasticsearch for storing and querying/searching the names?
It's a critical search use case, and MongoDB and Redis are perfect for key-based lookups and not use for Search purposes, while Elasticsearch is a distributed search engine, built specifically for such use-case.
Before choosing the system, you should know how your feature works internally And below the consideration for selecting it.
Non-functional requirements for your feature
What would be the total no of search queries per second (QPS)?
How frequently you would be updating the documents(ie, names in your example).
What is the SLA after names in updated and coming in the search result?
SLA for your search results.
Some functional requirements.
How autocomplete should look like, prefix, infix search on names?
Minimum how many character user should type, before showing them the autocomplete results.
How frequently the above requirements can change.
Elasticsearch indexed documents in the inverted index and works on
tokens match(which can be easily customized to suit business
requirements), hence super fast in searching. Redis and MongoDB are
not having this structure internally and shouldn't be used for this
use-case. You shouldn't have any doubt about choosing Elasticsearch over
these to implement Autocomplete.
You can use RediSearch (https://oss.redislabs.com/redisearch/). Its a free text search engine build on top of Redis as a RedisModule. It also has an auto complete feature.
Related
My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.
I want to implement a search functionality in my web application that I am building using java technology. I would have to search through the database, depending on the user query and will display the results. Which way can I go about doing this(please take note I am using java technology)??.Thanks.
You can use a product like http://lucene.apache.org/core/ or http://lucene.apache.org/solr/ for this instead of writing this on your own.
Lucene is a high-performance search engine for documents.
SOLR is built on top of Lucene and provides additional features (like hit highlighting, faceted search, database integration or rich document (Word, PDF, ..) search)
Lucene will analyze your text data and build up an index. When performing a search you run a lucene query against this index.
Assuming you mean free text searching of the data in the database...
For free text searching Lucene and/or SOLR are very good solutions. These work by creating a separate index of the data in your database. It is up to you to either pull the data from the database and index it using Lucene/SOLR or arrange your code that writes to the database to also update the Lucene/SOLR index. Given what you have said it sounds like this is being retrofitted to an existing database so pulling the data and indexing it may be the best solution. In this case SOLR is probbaly a better fit as it is a packaged solution.
Another option would be Hibernate Search. Again this would be a solution to use if you are starting out. It would be more difficult to add after the fact.
Also bear in mind some databases support free text searching in addition to normal relational queries and could be worth a look. SQL Server certainly has text search capabilities and I would imagine other databases have some sort of support. I am not too sure how you access these but I would expect to be able to do it using SQL via JDBC. It is likely to be database specific though.
If you just mean normal SQL searching then there are a whole load of Java EE technologies, plain JDBC, Spring templates, ORM technologies (JPA, JDO, Hibernate etc). The list goes on and it would be difficult to suggest any particular approach without a lot more info.
I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.
We are using AppEngine and the datastore for our application where we have a moderately large table of information containing a list with entries.
I would like to summarize the list of entries in a report specifying how many times each one appears e.g. normally in SQL I would just use a select distinct for a column, then loop over every entry and just use select count(x) where value = valueOfEntry.
While the count portion is easily done, the distinct problem is "a problem". The only solution I could find remotely close to this is MapReduce and most of the samples are based on Python. There is this blog entry which is very helpful but somewhat outdated since it predated the reduce portion. Then there is the video here and a few more resources I was able to find.
However, its really hard for me to understand how to build he summary table if I can't write to a separate entity and I don't have a reduce stage?
This seems like something trivial and simple to accomplish but requires so many hoops, is there no sample or existing reporting engine I can just plugin to AppEngine without all the friction?
I saw BigQuery, but it seems like a huge hassle to move the data out of app engine and into that store. I tried downloading the data as CSV but ran into many issues with that as well. It doesn't seem like a practical solution in the long run either.
There is a document explaining some of the concepts of the mapreduce for java. Although it is incomplete, it shares most of the architecture with the python version. In that document, there's also a pointer to a complete java sample mapreduce app, that reads from the datastore.
For writing the results, you specify an Output class. To write the results to a new datastore entity you would need to create your own Output Class. But you could also use the blobstore (see BlobFileOutput.java).
Other alternative, is that whenever you write one of your entities, you also write/update another entry to a EntityDistinct data model.
If you plan on performing complex reports and you can anticipate all your needs now, I would suggest you to look again at Big Query. BigQuery is really powerful and works perfectly on very massive datasets. You can inspect http://code.google.com/p/log2bq/ which is a python project that loads the logs into Big Query using mapreduce. Or you could also have a cron job, that every once in a while fetches all new entities and moves them into Big Query.
Related to the friction, remember that this is a no-sql database, and as such has some advantages but some things are inherently different to SQL. Remember you can always use Google Cloud SQL, given that your dataset is of limited size, but you would loose the replication and fault-tolerant capabilities.
I think this could help you: http://jjmpsj.blogspot.ro/2008/05/appengine-output-tricks-reporting.html?m=1
We have an in-house webapp running for internal use over the intranet for our firm.
Recently we decided to implement an efficient searching facility,and would want inputs from experts here about what all API's are available and which one would be most useful for the following use-cases:
The objects are divided into business groups in our firm, i.e an object can actually have various attributes, and the attributes as such are not common between any two objects from different BG(Business Groups)
Users might want to search for a specific attribute amongst an object
Users are from a business group, hence they have an idea about the kind of attributes related to their group
The API should be generic enough to have a full text/part text search if a list of object is passed to it, with the name of the attribute and the search text.More importantly it should be able to index this result.
As this is an internal app, there are no restrictions on the space as such, but we need a fast and generic API.
I am sure Java already has something which suits our needs.
More info on the technology stack:
Language:Java
Server: Apache Tomcat
Stack : Spring, iBatis, Struts
Cache in place : ECache
Other API : Shindig API
Thanks
Neeraj
You can use Solr for Apache Lucene if text based search has priotity. It might be more that what you want though have a look.
http://lucene.apache.org/solr/
http://lucene.apache.org/
Solr is a great tool for search. The downside is that it may require some work to get it the way you want it.
With it, you can set different fields for a document and give them custom priority in each query.
You can create facets easily from those fields like with Amazon. Sorting is easy and quick. And has a spellchecker and suggestions engine built in.
The documents are matched using the query mode dismax which you can customize.