Customizing StormCrawler

Customizing StormCrawler - java

I have installed StormCrawler including the Elasticsearch integration. I also completed the information videos found on Youtube from the creator of StormCrawler. This was a good introduction. I am also familiar with Apache Storm.
However, I find that there's a lack of how-to information and videos about how to go from there.
Now, this raises the question how to customize StormCrawler. Between which bolts should additional functionality be implemented? Also, how can I find out which fields are passed between these bolts, so that I find what information can be extracted? In addition, when saving documents to Elasticsearch, should I update the scheme for Elasticsearch, or can additional fields simply be send to the Elasticsearch bolt?

Now, this raises the question how to customize StormCrawler. Between
which bolts should additional functionality be implemented?
well, this depends on what you want to achieve. Can you give us an example?
Also, how
can I find out which fields are passed between these bolts, so that I
find what information can be extracted?
You can look at the declareOutputFields methods of the bolts you are using, for instance this one for the parser bolt. All bolts will have the URL and metadata object as input, some will have the binary content or text, depending on where they are in the chain.
In addition, when saving
documents to Elasticsearch, should I update the scheme for
Elasticsearch, or can additional fields simply be send to the
Elasticsearch bolt?
I think this is mentioned in one of the videos. ES does a pretty good job of guessing what type a field is based on its content but you might want to declare them explicitly to have full control on how they are indexed in ES.
Now for a practical answer based on the comment below. The good news is that all you need should already be available out of the box, no need to implement a custom bolt. What you need is the Tika module, which you will extract the text and metadata from the PDF. The difference with the README instructions is that you don't need to connect the output of the redirection bolt to the indexing bolt as you are not interested in indexing the non-PDF documents. The last thing is to to change
parser.mimetype.whitelist so that only PDF docs are parsed with Tika.
Don't forget to connect the Tika bolt to the statusupdaterbolt if you are using one.

Related

Should I use SolrJ to convert Lucene project into browser based search engine ?

My current search engine involves two desktop applications based on Lucene (java). One is dedicated to indexing internal documents, the other one to searching.
Now I have been asked to offer the search engine as a web page. So my first thought was to use Solr, so I read the manual (https://lucene.apache.org/solr/guide/7_4/overview-of-searching-in-solr.html) but then I realized that during the indexing phase we have special processing for PDFs. For example we detect whether the PDF originates from a scanned document, or we limit the number of pages that will be OCRed in a scanned PDFs since only the first pages are valuable for search. For now everything works via calls to Lucene API in classes with lots of if!
So my question is : should I use solrj to customize the indexing to our needs, should I keep the current indexing part and only use Solr(j) for searching, or should I overrides some Solr classes to meet our needs and avoid reinventing the wheel. For the latter (overriding Solr classes) how should I do ?
Thank you very much in advance for your advices

While this is rather opinion based - I'll offer my opinion. All your suggested solutions would work, but the best one is to write the indexing code as a separate process, externally to Solr (i.e. re-use your existing code that pushes data to a Lucene index directly today).
Take the tool you have today, and instead of writing data to a Lucene index, use SolrJ and submit the document to Solr instead. That will abstract away the Lucene part of the code you're using today, but will still allow you to process PDFs in your custom way. Keeping the code outside of Solr will also make it far easier to update Solr in the future, or switch to a newer version of the PDF library you're using for parsing without having to coordinate and integrate it into Solr.
It'll also allow you to run the indexing code completely separate from Solr, and if you decide to drop Solr for another HTTP interfaced technology in the future (for example Elasticsearch which is also based on Lucene), you can rip out the small-ish part that pushes content to Solr and push it to Elasticsearch instead.
Running multiple indexing processes in parallel is also easier when as much as possible of the indexing code is outside of Solr, since Solr will only be concerned with the actual text - and don't have to spend time processing and parsing PDFs when it should just be responding to user queries (and your updates) instead.

Apache Camel data types between components

I could not find any documentation around this on Apache Camel website.
How and what types are supported in the communication between the components in Apache Camel. I would like to understand the magic that happens inside it. Because the doc just says you consume data from a file system or FTP, a message from JMS, SQL data and tons of other possibilities and send them to a producer that magically seems to accept anything and output to tons of possibilities also.
Do they wrote converters for all to all types in the framework?
And I wonder the same question around enriches. All this connectors seem extremely flexible and I could not find any reference to what supports what in there. I'm willing to write a component for a system and I couldn't find a good way to do it.
Do I have to write converters for all possible types that can come?
I have seen that camel works with the Exchange class and it uses it to send back and forth the messages between components. It is pretty vague in my mind how the components deal with different possible message types.

I recommend you to take a look at "Camel in Action" book by Claus Ibsen and Jonathan Anstey. I used to have questions like those. They are perfectly answered there. Chapter 11.3 will guide you through creating your own component. Also, you can check out this github link to start with. It has an example of how to create your own component.
Camel may to know what types you pass in the message body, so it offers you multiple ways to transform the payload, starting from creating a processor for transformation, to using a java DSL transform method which accepts Expression.
Just be ready to handle a case, when an unknown object is consumed. Don't worry about all the incoming objects.
It all depends how the Consumers are implemented.

Data abstraction or Data Connector framework for Java

Note:There is a good chance I'm not using the correct terminology here and that maybe the reason I'm not finding the answers to my question. I apologize upfront if this has been already answered, so please just direct me there.
I am looking for an open source framework written in Java that would allow me to build pluggable data connectors (and obviously have some built in already) and almost have a query language (abstraction layer) that would translate into any of those connections.
For example: I would be able to say:
Fetch 1 record from a Mongo DB that matches name='John Doe'
and get JSON as a response
or I could say
Fetch all records from a MySQL DB that matches name='John Doe'
and get a JSON as a response
If not exactly what I described, I am willing to work with anything that would have a part of this solved.
Thank you in advance!

You're not going to find a "Swiss army knife" data abstraction framework that does all of the above. Perhaps the closest things to what you ask for would be JPA providers for both Mongo and MySQL (Hibernate is a well-regarded JPA provider for MySQL, and a quick google search shows Kundera, DataNucleus and Hibernate OGM for Mongo). This will let you map your data to Java Objects, which might be a step further than what you ask for since you explicitly asked for JSON; however, there are numerous options for mapping the resulting objects into JSON if you need to present JSON to a user or another system (Jackson comes to mind for this).

Try YADA, an open source data-abstraction framework.
From the README:
YADA is like a Universal Remote Control for data.
For example, what if you could access
any data set
at any data source
in any format
from any environment
using just a URL
with just one-time configuration?
You can with YADA.
Or, what if you could get data
from multiple sources
in different formats
merging the results
into a single set
on-the-fly
with uniform column names
using just one URL?
You can with YADA.
Full disclosure: I am the creator of YADA.

Implementing full text search in Java EE

I am making an application where I am going to need full text search so I found Compass, but that project is no longer maintained and is replaced by elasticsearch. However I don't understand it. Is it a own server that I need to do requests (get, put) etc against and then parse the JSON response? Are there no annotations like in Compass? I don't understand how this is a replacement and how I use it with Java EE.
Or are there other better projects to use?

Elasticsearch is a great choice nowadays, if you liked Compass you'll love it. Have a look at this answer that the author gave here on which he explains why he went ahead creating elasticsearch after Compass. In fact elasticsearch and Solr make both the use of Lucene pretty easy, adding also some features to it. You basically have a whole search engine server which is able to index your data, which you can then query in order to retrieve the data that you indexed.
Elasticsearch exposes RESTful APIs and it's JSON based, but if you are looking for annotations in the Compass style you can have a look at the Object Search Engine Mapper for ElasticSearch.

I would say give a try on lucene or solr. It creates a DocumentFileSystem for faster indexing

I would recommend either
Elasticsearch, if the system is big and need clustering.
Lucene or Solr, if you want to code at a low level
Hibernate search, if you are using Hibernate as your ORM.

Which API should I use to implement search on my website

We have an in-house webapp running for internal use over the intranet for our firm.
Recently we decided to implement an efficient searching facility,and would want inputs from experts here about what all API's are available and which one would be most useful for the following use-cases:
The objects are divided into business groups in our firm, i.e an object can actually have various attributes, and the attributes as such are not common between any two objects from different BG(Business Groups)
Users might want to search for a specific attribute amongst an object
Users are from a business group, hence they have an idea about the kind of attributes related to their group
The API should be generic enough to have a full text/part text search if a list of object is passed to it, with the name of the attribute and the search text.More importantly it should be able to index this result.
As this is an internal app, there are no restrictions on the space as such, but we need a fast and generic API.
I am sure Java already has something which suits our needs.
More info on the technology stack:
Language:Java
Server: Apache Tomcat
Stack : Spring, iBatis, Struts
Cache in place : ECache
Other API : Shindig API
Thanks
Neeraj

You can use Solr for Apache Lucene if text based search has priotity. It might be more that what you want though have a look.
http://lucene.apache.org/solr/
http://lucene.apache.org/

Solr is a great tool for search. The downside is that it may require some work to get it the way you want it.
With it, you can set different fields for a document and give them custom priority in each query.
You can create facets easily from those fields like with Amazon. Sorting is easy and quick. And has a spellchecker and suggestions engine built in.
The documents are matched using the query mode dismax which you can customize.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.