Solr query results - need searched text and a few lines around it - java

I am completely lost.
I think I am definitely missing something fundamental here. Everybody has such awesome stuff to say about Solr but I fail to see it.
I indexed a structured pdf document in Solr.
The problem is when I search for a simple string - I get the entire content field as the response!
I don't know how to change that.
My requirement is that, lets say I search for "metadata"
it should give me
"MetadataDiscussion . . . 4 matches
... make sure that Tika users have a chance to get to all of the metadata created and/or extracted by Tika. == Original Problem == The original inspiration for this page was a Tika ...
10.7k - rev: 2 (current)
last modified: 2010-08-02 18:09:45
"
But it gives me the whole document!- the entire string that was indexed.
It seems like Lucene can only tell me in which field it occurred, not where in the field it occurred
Any help will be greatly appreciated!!

Lucene/Solr is primarily a retrieval engine - it retrieves documents that match a query. So this behavior is desirable and expected. Now as for your requirement, you can use the highlighting feature of Solr to give you exactly that. Suppose your document text is stored in a field named text - then you would pass the following parameters to Solr:
&hl=true&hl.fl=text&hl.snippets=5&hl.fragsize=200
Look through the other parameters to customize it even further.
Solr is amazing :)

Related

Reject url's after fetching based on a condition in Nutch

I want to know whether it's possible to filter the url's that are fetched, based on a condition (for example published date or time). I know that we can filter the url's by regex-urlfilter for fetching.
In my case I don't want to index old documents. So, if a document is published before 2017 then, it has to be rejected. Is there any date filter plugin needed or it's already available !
Any help will be appreciated. Thanks in advance.
If you only want to avoid indexing old documents you could write your own IndexingFilter that will check your condition and avoid the indexing of the documents. You don't mention your Nutch version, but assuming that you're using v1 we have a new PR (it will be ready for the next release) that will offer this feature out of the box using JEXL expressions to allow/prevent documents from being indexed.
If you can grab the PR and test it and provide some feedback would be amazing!
You could write your own custom plugin if you want, and you can check the mimetype-filter for something similar to what you want (in this case we apply the filtering based on the mimetype).
Also a warning is in place, at the moment the fetchTime or modifiedTime that Nutch uses are coming from the headers that the webserver sends when the resource is fetched, keep in mind that these values should not be trusted (unless you are 100% sure) because in most cases you'll get wrong dates. NUTCH-1414 proposes a better approach to extracting the publication date from the content of the page, or you can implement your own parser.
Keep in mind that with this approach you still fetch/parse the old documents you'll just skip the indexing step.

querying multiple results from MediaWiki / Wikipedia using Android or Java

I am currently using MediaWiki's URL example to query HTTP GET requests on android.
I am simply getting information through a URL like this;
http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content
However, in this example, I always need some sort of direct title and only get one result back (titles=some name here)
I know that Wikipedia has more complex search methods explained here;
http://en.wikipedia.org/wiki/Help:Searching
I would like to offer a few "previews" of multiple wikipedia article per search, since what they type might not always be what they want.
Is there any way to query these special "search" results?
Any help would be appreciated.
It looks like the MediaWiki search API may be what you're after. That particular page discusses getting previews of search results.

Searching for words like "UTTD_Equip_City_TE" in Lucene

Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.
KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?
I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis

Indexing semi-structured data

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like #author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall
I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.
Here you can find more information about this project. If that is to extensive for you, I can sum up some things:
I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).

Java API : downloading and calculating tf-idf for a given web page

I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Download the given set of URLs
Extract the tokens
Remove the stop words
Perform Stemming
Create Inverted Index
Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards
Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document.
If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene.
If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.

Categories