We have an in-house webapp running for internal use over the intranet for our firm.
Recently we decided to implement an efficient searching facility,and would want inputs from experts here about what all API's are available and which one would be most useful for the following use-cases:
The objects are divided into business groups in our firm, i.e an object can actually have various attributes, and the attributes as such are not common between any two objects from different BG(Business Groups)
Users might want to search for a specific attribute amongst an object
Users are from a business group, hence they have an idea about the kind of attributes related to their group
The API should be generic enough to have a full text/part text search if a list of object is passed to it, with the name of the attribute and the search text.More importantly it should be able to index this result.
As this is an internal app, there are no restrictions on the space as such, but we need a fast and generic API.
I am sure Java already has something which suits our needs.
More info on the technology stack:
Language:Java
Server: Apache Tomcat
Stack : Spring, iBatis, Struts
Cache in place : ECache
Other API : Shindig API
Thanks
Neeraj
You can use Solr for Apache Lucene if text based search has priotity. It might be more that what you want though have a look.
http://lucene.apache.org/solr/
http://lucene.apache.org/
Solr is a great tool for search. The downside is that it may require some work to get it the way you want it.
With it, you can set different fields for a document and give them custom priority in each query.
You can create facets easily from those fields like with Amazon. Sorting is easy and quick. And has a spellchecker and suggestions engine built in.
The documents are matched using the query mode dismax which you can customize.
Related
I have installed StormCrawler including the Elasticsearch integration. I also completed the information videos found on Youtube from the creator of StormCrawler. This was a good introduction. I am also familiar with Apache Storm.
However, I find that there's a lack of how-to information and videos about how to go from there.
Now, this raises the question how to customize StormCrawler. Between which bolts should additional functionality be implemented? Also, how can I find out which fields are passed between these bolts, so that I find what information can be extracted? In addition, when saving documents to Elasticsearch, should I update the scheme for Elasticsearch, or can additional fields simply be send to the Elasticsearch bolt?
Now, this raises the question how to customize StormCrawler. Between
which bolts should additional functionality be implemented?
well, this depends on what you want to achieve. Can you give us an example?
Also, how
can I find out which fields are passed between these bolts, so that I
find what information can be extracted?
You can look at the declareOutputFields methods of the bolts you are using, for instance this one for the parser bolt. All bolts will have the URL and metadata object as input, some will have the binary content or text, depending on where they are in the chain.
In addition, when saving
documents to Elasticsearch, should I update the scheme for
Elasticsearch, or can additional fields simply be send to the
Elasticsearch bolt?
I think this is mentioned in one of the videos. ES does a pretty good job of guessing what type a field is based on its content but you might want to declare them explicitly to have full control on how they are indexed in ES.
Now for a practical answer based on the comment below. The good news is that all you need should already be available out of the box, no need to implement a custom bolt. What you need is the Tika module, which you will extract the text and metadata from the PDF. The difference with the README instructions is that you don't need to connect the output of the redirection bolt to the indexing bolt as you are not interested in indexing the non-PDF documents. The last thing is to to change
parser.mimetype.whitelist so that only PDF docs are parsed with Tika.
Don't forget to connect the Tika bolt to the statusupdaterbolt if you are using one.
I have to implement an autocomplete with over 500,000 names which may later increase to over 4 million names.
Backend is a java REST web service call using Spring. Should I use MongoDB, Redis or Elasticsearch for storing and querying/searching the names?
It's a critical search use case, and MongoDB and Redis are perfect for key-based lookups and not use for Search purposes, while Elasticsearch is a distributed search engine, built specifically for such use-case.
Before choosing the system, you should know how your feature works internally And below the consideration for selecting it.
Non-functional requirements for your feature
What would be the total no of search queries per second (QPS)?
How frequently you would be updating the documents(ie, names in your example).
What is the SLA after names in updated and coming in the search result?
SLA for your search results.
Some functional requirements.
How autocomplete should look like, prefix, infix search on names?
Minimum how many character user should type, before showing them the autocomplete results.
How frequently the above requirements can change.
Elasticsearch indexed documents in the inverted index and works on
tokens match(which can be easily customized to suit business
requirements), hence super fast in searching. Redis and MongoDB are
not having this structure internally and shouldn't be used for this
use-case. You shouldn't have any doubt about choosing Elasticsearch over
these to implement Autocomplete.
You can use RediSearch (https://oss.redislabs.com/redisearch/). Its a free text search engine build on top of Redis as a RedisModule. It also has an auto complete feature.
My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.
I want to implement a search functionality in my web application that I am building using java technology. I would have to search through the database, depending on the user query and will display the results. Which way can I go about doing this(please take note I am using java technology)??.Thanks.
You can use a product like http://lucene.apache.org/core/ or http://lucene.apache.org/solr/ for this instead of writing this on your own.
Lucene is a high-performance search engine for documents.
SOLR is built on top of Lucene and provides additional features (like hit highlighting, faceted search, database integration or rich document (Word, PDF, ..) search)
Lucene will analyze your text data and build up an index. When performing a search you run a lucene query against this index.
Assuming you mean free text searching of the data in the database...
For free text searching Lucene and/or SOLR are very good solutions. These work by creating a separate index of the data in your database. It is up to you to either pull the data from the database and index it using Lucene/SOLR or arrange your code that writes to the database to also update the Lucene/SOLR index. Given what you have said it sounds like this is being retrofitted to an existing database so pulling the data and indexing it may be the best solution. In this case SOLR is probbaly a better fit as it is a packaged solution.
Another option would be Hibernate Search. Again this would be a solution to use if you are starting out. It would be more difficult to add after the fact.
Also bear in mind some databases support free text searching in addition to normal relational queries and could be worth a look. SQL Server certainly has text search capabilities and I would imagine other databases have some sort of support. I am not too sure how you access these but I would expect to be able to do it using SQL via JDBC. It is likely to be database specific though.
If you just mean normal SQL searching then there are a whole load of Java EE technologies, plain JDBC, Spring templates, ORM technologies (JPA, JDO, Hibernate etc). The list goes on and it would be difficult to suggest any particular approach without a lot more info.
I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.