How to exclude numbers from Lucene Indexing?

How to exclude numbers from Lucene Indexing? - java

I am working on an information retrieval application, using Lucene 5.3.1 (latest as of now), I managed to index the terms from a text file and then search within it. The text file happens to contain chapter numbers like 2.1, 3.4.2 and so on and so forth.
The problem is that I don't need these numbers indexed, as I have no need to search for them, and I haven't been able to find out how to exclude certain terms from the tokenizing, I know the Analyzer uses the StopWords set to exclude several terms, but it doesn't do anything with numbers as far as I know.

The simplest answer I can come up with – remove numbers from text before indexing. You can use regular expressions for that. This solution has one side effect – PositionIncrementAttribute will be calculated without those numbers, as they do not appear in text. This can broke some of your PhraseQuery'ies.
Another option, as were already mentioned – write custom TokenFilter to strip numbers out. But you should remember:
to tune Analyzer to not explode terms on dots. Otherwise 2.1 will be two terms instead of one. This again can cause problems with PhraseQuery;
correctly change value of PositionIncrementAttribute (increment it) while removing terms from TokenStream.

Related

Practical to use snippets as search suggest?

I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.

I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!

You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!

Work out Analyzer, Version, etc. from Lucene index files?

Just double-checking on this: I assume this is not possible and that if you want to keep such info somehow bundled up with the index files in your index directory you have to work out a way to do it yourself.
Obviously you might be using different Analyzers for different directories, and 99% of the time it is pretty important to use the right one when constructing a QueryParser: if your QP has a different one all sorts of inaccuracies might crop up in the results.
Equally, getting the wrong Version of the index files might, for all I know, not result in a complete failure: again, you might instead get inaccurate results.
I wonder whether the Lucene people have ever considered bundling up this sort of info with the index files? Equally I wonder if anyone knows whether any of the Lucene derivative apps, like Elasticsearch, maybe do incorporate such a mechanism?
Actually, just looking inside the "_0" files (_0.cfe, _0.cfs and _0.si) of an index, all 3 do actually contain the word "Lucene" seemingly followed by version info. Hmmm...
PS other related thoughts which occur: say you are indexing a text document of some kind (or 1000 documents)... and you want to keep your index up-to-date each time it is opened. One obvious way to do this would be to compare the last-modified date of individual files with the last time the index was updated: any documents which are now out-of-date would need to have info pertaining to them removed from the index, and then have to be re-indexed.
This need must occur all the time in connection with Lucene indices. How is it generally tackled in the absence of helpful "meta info" included in with the index files proper?

Anyone interested in this issue:
It does appear from what I said that the Version is contained in the index files. I looked at the CheckIndex class and the various info you can get from that, e.g. CheckIndex.Status.SegmentInfoStatus, without finding a way to obtain the Version. I'm starting to assume this is deliberate, and that the idea is just to let Lucene handle the updating of the index as required. Not an entirely satisfactory state of affairs if so...
As for getting other things, such as the Analyzer class, it appears you have to implement this sort of "metadata" stuff yourself if you want to... this could be done by just including a text file in with the other files, or alternately it appears you can use the IndexData class. Of course your Version could also be stored this way.
For writing such info, see IndexWriter.setCommitData().
For retrieving such info, you have to use one of several (?) subclasses of IndexReader, such as DirectoryReader.

Search stem and exact words in Lucene 4.4.0

i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.

Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets
With document having text = "CouchDB"
When i search for "couc"
My highlight is on "cou" and not "couc"
It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.
It works fine without analyzing the text with term_vector=with_positions_offsets
What's the impact of removing the term_vector=with_positions_offsets for perfomances?

When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.
When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.
Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.
elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.
Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.

I know this question is old, but it was not yet answered completely:
There is another option that can yield to such a strange behaviour:
You have to set require_field_match to true if you don't want that other results of documents should influence the current document highlighting, see: http://www.elasticsearch.org/guide/reference/api/search/highlighting/

Find arbitary patterns common to a group of strings

Background:
I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.
I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.
I am using Java.
Problem:
Some folders contain movie files such as:
Episode 301 Rainforest Schmainforest.avi
Episode 302 Spontaneous Combustion.avi
The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.
Aother folder might be:
720p.S5.E1.cripple fight.avi
720p.S5.E2.towelie.avi
Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.
Summary:
Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]

Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.
In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.
But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm
The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.
Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.

I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.
Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.
Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...

I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd
I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.
The code linked above uses Python (not Java) to remove all non-unique words in a file name.
Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.
The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.
I have used it a lot recently and it works well.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.