Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - java

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets
With document having text = "CouchDB"
When i search for "couc"
My highlight is on "cou" and not "couc"
It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.
It works fine without analyzing the text with term_vector=with_positions_offsets
What's the impact of removing the term_vector=with_positions_offsets for perfomances?

When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.
When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.
Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.
elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.
Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.

I know this question is old, but it was not yet answered completely:
There is another option that can yield to such a strange behaviour:
You have to set require_field_match to true if you don't want that other results of documents should influence the current document highlighting, see: http://www.elasticsearch.org/guide/reference/api/search/highlighting/

Related

How to force dependency / linkage of genes in a genetic algorithm?

For a current project, I want to use genetic algorithms - currently I had a look at the jenetics library.
How can I force that some genes are dependent on each other? I want to map CSS on the gene, f.e. I have genes indicating if an image is displayed, and in case it is also the respective height and width. So I want to have those genes as a group togheter, as it would make no sense that after a crossover, the chrosome would indicate something like "no image" - height 100px - width 0px.
Is there a method to do so? Or maybe another library (in java) which supports this?
Many thanks!

You want to embed more knowledge into your system to reduce the search space.
If it would be knowledge about the structure of the solution, I would propose taking a look at grammatical evolution (GE). Your knowledge appears to be more about valid combinations of codons, so GE is not easily applicable.
It might be possible to combine a few features into a single codon, but this may be undesirable and/or unfeasible (e.g. due to great number of possible combinations).
But in fact you don't have an issue here:
it's fine to have meaningless genotypes — they will be removed due to the selection pressure
it's fine to have meaningless codon sequences — it's called "bloat"; bloat is quite common to some evolutionary algorithms (usually discussed in the context of genetic programming) and is not strictly bad; fighting with bloat too much can reduce the search performance

If you know how your genome is encoded - that is, you know which sequences of chromosomes form groups - then you could extend (since you mention jenetics) io.jenetics.MultiPointCrossover to avoid splitting groups. (Source code available on GitHub.)
It could be as simple as storing ranges of genes which form groups if one of the random cut indexes would split a group, adjusting the index to the nearest end of the group. (Of course this would cause a statistically higher likelihood of cuts at the ends of groups; it would probably be better to generate a new random location until it doesn't intersect a group.)
But it's also valid (as Pete notes) to have genes which aren't meaningful (ignored) based on other genes; if the combination is anti-survival it will be selected out.

Practical to use snippets as search suggest?

I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.

I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!

You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!

How to exclude numbers from Lucene Indexing?

I am working on an information retrieval application, using Lucene 5.3.1 (latest as of now), I managed to index the terms from a text file and then search within it. The text file happens to contain chapter numbers like 2.1, 3.4.2 and so on and so forth.
The problem is that I don't need these numbers indexed, as I have no need to search for them, and I haven't been able to find out how to exclude certain terms from the tokenizing, I know the Analyzer uses the StopWords set to exclude several terms, but it doesn't do anything with numbers as far as I know.

The simplest answer I can come up with – remove numbers from text before indexing. You can use regular expressions for that. This solution has one side effect – PositionIncrementAttribute will be calculated without those numbers, as they do not appear in text. This can broke some of your PhraseQuery'ies.
Another option, as were already mentioned – write custom TokenFilter to strip numbers out. But you should remember:
to tune Analyzer to not explode terms on dots. Otherwise 2.1 will be two terms instead of one. This again can cause problems with PhraseQuery;
correctly change value of PositionIncrementAttribute (increment it) while removing terms from TokenStream.

Search stem and exact words in Lucene 4.4.0

i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.

Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

Can I make Lucene return an unlimited number of search results?

I am using Lucene 3.0.1 in a Java 5 environment.
I've been researching this issue a little bit, but the documentation hasn't given any direct answers.
Using the search method
TopFieldDocs search(Weight weight, Filter filter, int nDocs, Sort sort)
I always need to provide a maximum number of search results nDocs.
What if I wanted to have all matching results? It feels like setting nDocs to Integer.MAX_VALUE is a kind of hacky way to do this (and would result in speed and memory performance drop?).
Anyone else who has any idea?

You are using a search method that returns the top n hits for a query.
There are other (more low-level) methods that do not have the limitation, and it says in the documentation that "applications should only use this if they need all of the matching documents. The high-level search API (search(Query, int)) is usually more efficient, as it skips non-high-scoring hits.".
So if you really need all documents, you can use the low-level API. I doubt that it makes a big difference in performance to passing a really high limit to the high-level API. If you need all documents (and there really are a lot of them), it is going to be slow either way, especially if sorting is involved.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.