How to modify search result page given by Solr? - java

I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies.
I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl.
Using solr interface at http://localhost:8983/solr/admin, I can also query the crawled results. But this is the output I receive. .
Am I missing something here, the earlier apache-nutch-0.7 had a war which generated a clear html output like this. . How do I achieve this... Or if anyone could point me to a latest tutorial or guidebook, highly appreciated.

A couple of things:
If you are just starting, do not use Solr 3.6, go straight to latest 4.1+. A bunch of things have changed and a lot of new features are added.
You seem to be saying that you will expose Solr + UI directly to general web - that's a really bad idea, as Solr is completely unsecured and allows web-based delete queries. You really want a business layer in a middle.
With Solr 4.1, there is a pretty Admin UI and, also, there is a /browse page that shows how to use Velocity to do the pages backed by Solr. Or have a look at something like Project Blacklight for an example of how to get UI over Solr.

I found below link
http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
which answered my query.
I agree after reading the content available on above link, I felt very angry at me.
Solr package provides all the required objects to query solr.
Infact, the essential jars are just solr-solrj-3.4.0.jar, commons-httpclient-3.1.jar and slf4j-api-1.6.4.jar.
Anyone can build a java search engine using these objects to query the database and have a fancy UI.
Thanks again.

Related

Should I use SolrJ to convert Lucene project into browser based search engine ?

My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices
While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.

Implementing full text search in Java EE

I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?
Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.
I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing
I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.

Using Lucene/Solr with Spring Data

I am using Spring Data (Mongo) for my web application (close to a social networking website). Now, I wish to provide search capabilities over the content written within the application (such as posts, tags, friends, etc.).
I believe Lucene/Solr is one of the better libraries to go for such cases, but am not sure how to use (integrate?) it with Spring Data (or maybe there is some inherent support within Spring for it).
Would appreciate help (documentation, links, blog posts, etc.) on this!
Though the post has been around for a while, you may have a look at this one https://github.com/SpringSource/spring-data-solr/
The Spring Data for Solr project provides a natural Spring Data like API for querying data from Solr. Read the examples for a quick overview.
I found a good read here - http://adeithzya.wordpress.com/2011/08/25/using-apache-solr-with-spring-framework - that hits the nail on its head!
Integrating them is relatively easy, the difficult part is maintaining data consistency between them. For example, how would you answer these questions:
How and when do you intend to perform CRUD with mongo and sorl? Do you write to Mongo first (with/without waiting for a confirmation?) and then to Solr?
if you're using async writes with mongo, what happens when you send the data to solr, and then get an exception for mongo (data exist in solr, but doesn't exist in mongo)?
What happens if you get an error while trying to write to solr (data exist in mongo but not in solr)?
if you delete something from mongo, and right after that someone performs a search where solr returns that very deleted document because solr stil has that document indexed?
The point is there'll be an inconsistency window where mongo and solr are not in sync, and you probably want to handle at least some of the issues.

what is the best strategy for a log analyze applications

i will develop a web application to view and analyze log files from both remote machines and locally and planning to use java. At first glance it seems like application must work with big data sets effectively. For example to list a log file on browser i should implement a paginated list working with ajax (server will give data accordingly with current page number). Also i like to use AJAX.
My question is how should i design an application like this. i have three possibilities:
AJAX with RESTful service.
JSP and servlet
JSF with AJAX
I would suggest you have a look at Chainsaw - http://logging.apache.org/chainsaw/index.html - and Lilith - http://lilith.huxhorn.de/ - to see how others have approached this.
The released version of Chainsaw is pretty old - a MAJOR update will be released shortly. If you want to try out a pre-release version, you can see a screenshot and get the tarball or Mac DMG here:
http://people.apache.org/~sdeboy/

Automatic sitemap generation

We have recently installed a Google Search Appliance in order to power our internal search (via the Java API), and all seems to be well, however I have a question regarding 'automatic' site-map generation that I'm hoping you guys may know the answer to.
We are aware of the GSA's ability to auto-generate site maps for each of its collections, however this process is rather manual, and considering that we have around 10 regional sites that need to be updated as often as possible, its not ideal to have to log into the admin interface on a regular basis in order to export them to the site root where search engines can find them.
Unfortunately there doesn't seem to be any API support for this, at least none that I can find, so I was wondering if anyone had any ideas for a solution/workaround or, if all else fails, the best alternative.
At present I'm thinking that if we can get the full index back from the API in the form of a list, then we can write an XML file out using that the old fashioned way using a chronjob or similar, however this seems like a bit of a clumsy solution - any better ideas.
You could try the GSA Admin Toolkit, or simply write some code yourself which just logs in on the administration page and then uses that session to invoke the sitemap export URL (which is basically what the Admin Toolkit does).

Categories