I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.
Related
My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.
I want to implement a search functionality in my web application that I am building using java technology. I would have to search through the database, depending on the user query and will display the results. Which way can I go about doing this(please take note I am using java technology)??.Thanks.
You can use a product like http://lucene.apache.org/core/ or http://lucene.apache.org/solr/ for this instead of writing this on your own.
Lucene is a high-performance search engine for documents.
SOLR is built on top of Lucene and provides additional features (like hit highlighting, faceted search, database integration or rich document (Word, PDF, ..) search)
Lucene will analyze your text data and build up an index. When performing a search you run a lucene query against this index.
Assuming you mean free text searching of the data in the database...
For free text searching Lucene and/or SOLR are very good solutions. These work by creating a separate index of the data in your database. It is up to you to either pull the data from the database and index it using Lucene/SOLR or arrange your code that writes to the database to also update the Lucene/SOLR index. Given what you have said it sounds like this is being retrofitted to an existing database so pulling the data and indexing it may be the best solution. In this case SOLR is probbaly a better fit as it is a packaged solution.
Another option would be Hibernate Search. Again this would be a solution to use if you are starting out. It would be more difficult to add after the fact.
Also bear in mind some databases support free text searching in addition to normal relational queries and could be worth a look. SQL Server certainly has text search capabilities and I would imagine other databases have some sort of support. I am not too sure how you access these but I would expect to be able to do it using SQL via JDBC. It is likely to be database specific though.
If you just mean normal SQL searching then there are a whole load of Java EE technologies, plain JDBC, Spring templates, ORM technologies (JPA, JDO, Hibernate etc). The list goes on and it would be difficult to suggest any particular approach without a lot more info.
I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies.
I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl.
Using solr interface at http://localhost:8983/solr/admin, I can also query the crawled results. But this is the output I receive. .
Am I missing something here, the earlier apache-nutch-0.7 had a war which generated a clear html output like this. . How do I achieve this... Or if anyone could point me to a latest tutorial or guidebook, highly appreciated.
A couple of things:
If you are just starting, do not use Solr 3.6, go straight to latest 4.1+. A bunch of things have changed and a lot of new features are added.
You seem to be saying that you will expose Solr + UI directly to general web - that's a really bad idea, as Solr is completely unsecured and allows web-based delete queries. You really want a business layer in a middle.
With Solr 4.1, there is a pretty Admin UI and, also, there is a /browse page that shows how to use Velocity to do the pages backed by Solr. Or have a look at something like Project Blacklight for an example of how to get UI over Solr.
I found below link
http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
which answered my query.
I agree after reading the content available on above link, I felt very angry at me.
Solr package provides all the required objects to query solr.
Infact, the essential jars are just solr-solrj-3.4.0.jar, commons-httpclient-3.1.jar and slf4j-api-1.6.4.jar.
Anyone can build a java search engine using these objects to query the database and have a fancy UI.
Thanks again.
I am using Spring Data (Mongo) for my web application (close to a social networking website). Now, I wish to provide search capabilities over the content written within the application (such as posts, tags, friends, etc.).
I believe Lucene/Solr is one of the better libraries to go for such cases, but am not sure how to use (integrate?) it with Spring Data (or maybe there is some inherent support within Spring for it).
Would appreciate help (documentation, links, blog posts, etc.) on this!
Though the post has been around for a while, you may have a look at this one https://github.com/SpringSource/spring-data-solr/
The Spring Data for Solr project provides a natural Spring Data like API for querying data from Solr. Read the examples for a quick overview.
I found a good read here - http://adeithzya.wordpress.com/2011/08/25/using-apache-solr-with-spring-framework - that hits the nail on its head!
Integrating them is relatively easy, the difficult part is maintaining data consistency between them. For example, how would you answer these questions:
How and when do you intend to perform CRUD with mongo and sorl? Do you write to Mongo first (with/without waiting for a confirmation?) and then to Solr?
if you're using async writes with mongo, what happens when you send the data to solr, and then get an exception for mongo (data exist in solr, but doesn't exist in mongo)?
What happens if you get an error while trying to write to solr (data exist in mongo but not in solr)?
if you delete something from mongo, and right after that someone performs a search where solr returns that very deleted document because solr stil has that document indexed?
The point is there'll be an inconsistency window where mongo and solr are not in sync, and you probably want to handle at least some of the issues.
We have an in-house webapp running for internal use over the intranet for our firm.
Recently we decided to implement an efficient searching facility,and would want inputs from experts here about what all API's are available and which one would be most useful for the following use-cases:
The objects are divided into business groups in our firm, i.e an object can actually have various attributes, and the attributes as such are not common between any two objects from different BG(Business Groups)
Users might want to search for a specific attribute amongst an object
Users are from a business group, hence they have an idea about the kind of attributes related to their group
The API should be generic enough to have a full text/part text search if a list of object is passed to it, with the name of the attribute and the search text.More importantly it should be able to index this result.
As this is an internal app, there are no restrictions on the space as such, but we need a fast and generic API.
I am sure Java already has something which suits our needs.
More info on the technology stack:
Language:Java
Server: Apache Tomcat
Stack : Spring, iBatis, Struts
Cache in place : ECache
Other API : Shindig API
Thanks
Neeraj
You can use Solr for Apache Lucene if text based search has priotity. It might be more that what you want though have a look.
http://lucene.apache.org/solr/
http://lucene.apache.org/
Solr is a great tool for search. The downside is that it may require some work to get it the way you want it.
With it, you can set different fields for a document and give them custom priority in each query.
You can create facets easily from those fields like with Amazon. Sorting is easy and quick. And has a spellchecker and suggestions engine built in.
The documents are matched using the query mode dismax which you can customize.