Enabling indexing for version store in alfresco (solr-search subsystem)

Enabling indexing for version store in alfresco (solr-search subsystem) - java

Is there any way to enable indexing for version2Store in alfresco?.
I'm using Alfresco 4.2.c & solr-search subsystem.
My requirement is as follows:
User will search based on content in alfresco. If a word is not present in latest version but present in previous version then search result should display older version. But if I put lucene query against version2Store nothing is listed since it is not indexed. How to enable indexing for version2Store. Please help me in this regard.
Thanks in advance

As discussed via comments we are talking about the SOLR search subsystem here.
Please take a look into the Alfresco wiki here & post another more detailed questions if you're still struggling: http://wiki.alfresco.com/wiki/Alfresco_And_SOLR#Configuring_query_against_additional_stores
You should also keep in mind that just adding the store into the index does not include the store in your queries automatically. Take a look into org.alfresco.service.cmr.search.SearchParameters.addStore(...)
AFAIK it's only possible to execute a search for 1 store only as the stores are handled as SOLR cores internally.

Related

Customizing StormCrawler

I have installed StormCrawler including the Elasticsearch integration. I also completed the information videos found on Youtube from the creator of StormCrawler. This was a good introduction. I am also familiar with Apache Storm.
However, I find that there's a lack of how-to information and videos about how to go from there.
Now, this raises the question how to customize StormCrawler. Between which bolts should additional functionality be implemented? Also, how can I find out which fields are passed between these bolts, so that I find what information can be extracted? In addition, when saving documents to Elasticsearch, should I update the scheme for Elasticsearch, or can additional fields simply be send to the Elasticsearch bolt?

Now, this raises the question how to customize StormCrawler. Between
which bolts should additional functionality be implemented?
well, this depends on what you want to achieve. Can you give us an example?
Also, how
can I find out which fields are passed between these bolts, so that I
find what information can be extracted?
You can look at the declareOutputFields methods of the bolts you are using, for instance this one for the parser bolt. All bolts will have the URL and metadata object as input, some will have the binary content or text, depending on where they are in the chain.
In addition, when saving
documents to Elasticsearch, should I update the scheme for
Elasticsearch, or can additional fields simply be send to the
Elasticsearch bolt?
I think this is mentioned in one of the videos. ES does a pretty good job of guessing what type a field is based on its content but you might want to declare them explicitly to have full control on how they are indexed in ES.
Now for a practical answer based on the comment below. The good news is that all you need should already be available out of the box, no need to implement a custom bolt. What you need is the Tika module, which you will extract the text and metadata from the PDF. The difference with the README instructions is that you don't need to connect the output of the redirection bolt to the indexing bolt as you are not interested in indexing the non-PDF documents. The last thing is to to change
parser.mimetype.whitelist so that only PDF docs are parsed with Tika.
Don't forget to connect the Tika bolt to the statusupdaterbolt if you are using one.

Should I use SolrJ to convert Lucene project into browser based search engine ?

My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices

While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.

How to modify search result page given by Solr?

I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies.
I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl.
Using solr interface at http://localhost:8983/solr/admin, I can also query the crawled results. But this is the output I receive. .
Am I missing something here, the earlier apache-nutch-0.7 had a war which generated a clear html output like this. . How do I achieve this... Or if anyone could point me to a latest tutorial or guidebook, highly appreciated.

A couple of things:
If you are just starting, do not use Solr 3.6, go straight to latest 4.1+. A bunch of things have changed and a lot of new features are added.
You seem to be saying that you will expose Solr + UI directly to general web - that's a really bad idea, as Solr is completely unsecured and allows web-based delete queries. You really want a business layer in a middle.
With Solr 4.1, there is a pretty Admin UI and, also, there is a /browse page that shows how to use Velocity to do the pages backed by Solr. Or have a look at something like Project Blacklight for an example of how to get UI over Solr.

I found below link
http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
which answered my query.
I agree after reading the content available on above link, I felt very angry at me.
Solr package provides all the required objects to query solr.
Infact, the essential jars are just solr-solrj-3.4.0.jar, commons-httpclient-3.1.jar and slf4j-api-1.6.4.jar.
Anyone can build a java search engine using these objects to query the database and have a fancy UI.
Thanks again.

Juddi publish and find service

I have successfully set up an Apache Juddi v3 installation (tomcat version) on my computer. What I want now is to publish a service whose WSDL is found at
http://localhost:8080/axis2/services/CmmdcService/wsdl
To achieve this, I created a standalone Java application (starting from the Juddi documentation) that publishes the service found at the above location.
The publish part looks ok, but then I want to query the juddi database for the service but a field that should contain the found services is always null (getServiceInfos()). I really don't know what is wrong and I didn't find any good documentation or tutorial about this on the internet.
Here you can find the sources of the program. Just unarchive it and go to the ./publish folder. The application is found there.

With out much Apache knowledge, It sounds as if getServiceInfos() function is trying to retrieve information from the wrong sub folder when you do a query. Try changing the location of the search Function so that it will search all folders/locations or a specific folder/location where the database is located.
I could be wrong ( I have limited skills with Apache ).
Good luck, sorry if this confused you or did not help.

Edit: Sorry, I misread the question. I'm not sure what search criteria you've specified, but the server didn't return any results.
When using the "approximateMatch" find qualifier, you really need to specify a wildcard character, such as % (any number of characters) or _ (a single character).
Long story short, this is probably a bug that has since been fixed. Try a newer version

Automatic sitemap generation

We have recently installed a Google Search Appliance in order to power our internal search (via the Java API), and all seems to be well, however I have a question regarding 'automatic' site-map generation that I'm hoping you guys may know the answer to.
We are aware of the GSA's ability to auto-generate site maps for each of its collections, however this process is rather manual, and considering that we have around 10 regional sites that need to be updated as often as possible, its not ideal to have to log into the admin interface on a regular basis in order to export them to the site root where search engines can find them.
Unfortunately there doesn't seem to be any API support for this, at least none that I can find, so I was wondering if anyone had any ideas for a solution/workaround or, if all else fails, the best alternative.
At present I'm thinking that if we can get the full index back from the API in the form of a list, then we can write an XML file out using that the old fashioned way using a chronjob or similar, however this seems like a bit of a clumsy solution - any better ideas.

You could try the GSA Admin Toolkit, or simply write some code yourself which just logs in on the administration page and then uses that session to invoke the sitemap export URL (which is basically what the Admin Toolkit does).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.