Java get query expansion terms

Java get query expansion terms - java

Does anyone know if its possible to easily get query expansion terms programmatically using Java?
For example when you do a google search, at the bottom of the results page there is a "Searches Related to [term]" section, is there a way to harvest those terms? I feel like this would be easiest.
I dont want to create my own query expansion algorithm because of time constraints and would like a quick and easy way to get the terms.
for example "fashion design" -> ["fashion design courses", "fashion design careers", "fashion design sketches"]
thank you in advance for any help
-MC

You could scrape the google answer to your query (making an HTTP request), capture the content near "Search relates.." section, and extract suggested terms.
Pro:
You use google (without implementing your query-expansion algorithm).
Cons:
You need internet connection

The Wikipedia page for query expansion points to two Java implementations that might be helpful:
LucQE
LuceneQE

QueryParser.parse(string) ;
This simple yet powerful call uses Lucene's Query Parser to parse the string and returns a Query object that contains all the Terms in a tree like structure. Use this to convert a natural language search query like content:(whatever) in:(inbox,sent) AND body:(something) NOT (nothing) into an appropriate tree structure retaining all the logical conditions such as AND OR NOT etc and it is really easy to traverse. Now your java app has the power of understanding simple google like syntax. Of course you also need a powerful back end to support such searches! Lucene can help you with that as well.

Related

Custom Edit distance weights for operations in Lucene FuzzySearch

I came across this python library https://pypi.org/project/weighted-levenshtein/ which allows to specify different costs/weights for different operations(insertion, substitution, deletion and transposition) which is very helpful in detecting and correcting keystroke errors.
I have been searching through lucene library FuzzySearch which uses Damerau-Levenstein distance to check if something like this is supported to specify different costs/weights for different operations but not able to find any.
Please let me know if there exists a way to specify our custom costs/weights within Lucene Fuzzy-Search.
Thanks in advance!

To accomplish this you would have to extend and/or edit lucene code. To support fuzzy matching, lucene compiles an Automaton using the LevenshteinAutomata class, which implements this algorithm, and not only doesn't support edit weights, but only supports matching for up to 0 to 2 edits.
How one might edit this algorithm to produce an automaton that supports weighted edits is beyond my knowledge, but could be worth a try as it would make your customization simple (would only have to override the getAutomaton method) and would (theoretically) keep performance consistent.
The alternative would be to forgo the idea of an automaton to support fuzzy matching and simply implement a weighted levenshtein algorithm, like the one you have linked to, directly in the actual fuzzy match check. By doing this, however, you could pay a rather high performance cost depending on the nature of the fuzzy queries you handle and the content of your index.

How to collapse duplicates in search results

We use Hibernate Search 6 CR2 with Elasticsearch and Spring Boot 2.4.0. Is there any way to collapse duplicates in search results?
We tried to kind of "collapse" them like this:
searchResults = searchSession.search(Items.class)
.select(f -> f.field(field.getCode(), String.class))
.where(f -> f.phrase()
.field(field.getCode())
.matching(phrase)
.slop(SLOP))
.fetchHits(20)
.stream()
.distinct()
.collect(Collectors.toList());
...but this method works only on small amount of results (less than fetchHits size) and when there are not so many identical hits. When we tried this method on another index with thousands hits (~28M docs) we saw that it's not working as expected because of fetchHits setting -- some search results that should be -- are lost. And of course, the main question here is that by using this method we don't distinct search results while searching, it happens after the original search, so it's not the best solution.
Another solution was found here but it's a bit outdated and not an actual answer for our question.
Over Hibernate Search forums we found another solution for similar task, we tried to implement it and it worked, but as a downsides we got 2x multiplication for index document fields (8 fields now instead of 4).
So after all, is it possible to tune HS to collapse duplicates in search results without help of these extra-fields? Or, if it's OK... Okay then! We'll remember this and use as a solution in future cases.
P.S.: we implement search-as-you-type prediction service so it's not necessary for original entities to be extracted.

The solution you linked is the most straightforward way to get a list of all values in matched documents for a given field. It is what aggregations are for.
Yes, it requires additional fields. Generally speaking, you can't get performance out of thin air: to get a smaller execution time, you need to use more memory.
That being said, if what you want is suggestions, you should probably have a look at Elasticsearch's suggester feature.
There is no API for this in Hibernate Search (yet), so you will have to transform JSON in order to leverage this feature. It's relatively easy, and you even have an example for your very use case in the reference documentation (have a look at the second example).
Of course if you really want to use phrase queries, it's going to be more complicated. I'd suggest you have a look at the phrase suggester or maybe the completion suggester.
Should you need to register a field with a type that is not supported out of the box by Hibernate Search (e.g. completion), it's possible too: you will just need a custom bridge. See this example.

Should I use Lucene just for Highlighting?

I have an application that searches text indexed in a MSSQL database. My current search functionality works fine. I just need to display the search results with the surrounding text of the search terms (like google does). The only tool I could find to do this is Lucene's text highlighting. I read about it from this question: Displaying sample text from the Lucene Search Results. I haven't looked into Lucene for very long, but I'm guessing I'd have to create documents for each search hit.
I was wondering if what I want to do is even possible with Lucene, and whether it'd be overkill to use a tool like this for my purpose. Are there any other tools I could/should use for this?

It depends on the size of the text you are trying to highlight, but if it is rather small, you could use Lucene highlighting functionality on top of your search backend. See Highlighter documentation for more information.
In case this would not be fast enough for you (if you want to highlight large chunks ot text for example), Lucene can make highlighting faster by using term vectors, but this would require you to move your backend from MSSQL to Lucene.

If you already can get surrounding text of the found keywords, and this is really the only thing you need, then yes, Lucene is an overkill - just surround your keywords with highlighting tags. However, in most cases
But in most cases as times goes people start thinking of other advanced options, such as stemming (if you search for "highlight" you also find "highlighting" and "highlighter"), synonym search, language detection, etc. If you ever thought you may need such things, or even you haven't ready algorithm to find text snippets with keywords (surrounding text), I highly recommend you diving into Lucene world. The best option I can think of is to index all your text fields from MSSQL and base all your text search on Lucene.
If you are afraid of hard Lucene coding, you may use Solr - Lucene-based web server with extremely wide range of capabilities, easily configured with XML-files. Solr has both - simple web and a number of programming interfaces (Solrj for Java).

It would be overkill&: Lucene is a complete search/indexing engine with stemming, scoring, and other stuff. It's likely better than what you're doing, but it depends on your goals.
If you're just doing simple keyword highlighting, consider a regex to insert highlighting tags.

Advanced search

We are going to provide an advanced search option on a system that will let users find events that matches a name (textual search), have on or more tags assigned to it and that will start before or after a given date. Should I consider using hibernate search or something similar? Or should I just generate some jpql queries to get that search feature working.

use hibernate search, that is what it is there for and you will get better performance
trying to construct the queries on the terms you mentioned
name
date
date range
tag
plus support for boolean queries
just to complex

I'd suggest taking a good look at Hibernate Search so you can leverage the power of Lucene.
If you relatively simple requirements for search initially, perhaps implementing it yourself won't be so bad, but as you want to add features and scale up your search, this will require you to write more and more code and make the feature more complicated. Why not reuse a powerful, well-known library that already does all of this (and more)?

How can I index a lot of txt files? (Java/C/C++)

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?
update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)

For java try Lucene

I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.

It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.
If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...
Some 3rd party solutions are listed here.

Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.
The Overview can give you more info
http://www.compass-project.org/overview.html
I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.

Lucene - Java
It's open source as well so you are free to use and deploy in your application.
As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions

Also take a look at Lemur Toolkit.

Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.
As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.
I believe the Boost::Statechart library may be of some help for that matter.

I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java get query expansion terms - java

You could scrape the google answer to your query (making an HTTP request), capture the content near "Search relates.." section, and extract suggested terms. Pro: You use google (without implementing your query-expansion algorithm). Cons: You need internet connection

The Wikipedia page for query expansion points to two Java implementations that might be helpful: LucQE LuceneQE

Related

Custom Edit distance weights for operations in Lucene FuzzySearch

How to collapse duplicates in search results

Should I use Lucene just for Highlighting?

Advanced search

How can I index a lot of txt files? (Java/C/C++)

Categories

Resources