Nesting searches in Lucene without duplicating keywords - java

I want to set up a search in Lucene (actually Lucene.NET, but I can convert from Java as necessary) using the following logic:
Search string is: A B C
Search one field in the index for anything that matches A, B, or C. (Query: (field1:A field1:B field1:C))
For each term that didn't match in step 2, search a second field for it while keeping the results from the first search (Query: (+(field1:A) +(field2:B field2:C)))
For each term that didn't match in step 3, search a third field...
Continue until running out of fields, or there's a search which has used every term.
Currently, my code can test whether a given search produces NO results, and ANDs together all the ones that do produce results. But I have no way to stop it before it tests against every field (which unnecessarily limits the results) - it's currently ending up with a query like: (+(field1:A field1:B field1:C) +(field3:A field3:B field3:C)) when I want it to be (+(field1:A field1:C) +(field3:B)). I can't just look at the results from the first search and remove words from the search string because the Analyzer mangles the words when it parses it for search, and I have no way to un-mangle them to figure out which of the original search terms it corresponds to.
Any suggestions?
Edit:
Ok, generally I prefer describing my problems in the abstract, but I think some part of it is getting lost in the process, so I'll be more concrete.
I'm building a search engine for an site which needs to have several layers of search logic. A few example searches which I'll trace out are:
Headphones
Monster Headphones
White Monster Headphones
White Foobar Headphones
The index contains documents with seven fields - the relevant ones to this example are:
"datattype": A string representing what type of item this document represents (product, category, brand), so we know how to display it
"brand": The brand(s) that are relevant (categories have multiple brands, products and brands have one each)
"path": The path to a given category (i.e. "Audio Headphones In-Ear" for "Audio > Headphones > In-Ear")
"keywords": Various things that describe the product that don't go anywhere else.
In general, the logic for each step of the search is as follows:
Check to see if we have a match.
If so, filter the results based on that match, and continue parsing the rest of the search terms in the next step.
If not, parse the search terms in the next step.
Each step is something like:
Search for a category
Search for a brand
Search for keywords
So here's how those three example searches should play out:
Headphones
Search for a category: +path:headphones +datatype:Category
There are matches (the Headphone category), and no words from the original query are left, so we return it.
Monster Headphones
Search for a category: `+(path:monster path:headphones) +datatype:Category
Matches were found for path:headphones and datatype:Category, leaving "Monster" unmatched
Search for a brand: +path:headphones +brand:monster
Matches were found for path:headphones and brand:monster, and no words from the original query are left, so we return all the headphones by Monster.
White Monster Headphones
Search for a category: +(path:monster path:headphones path:white) +datatype:Category
Matches were found for path:headphones, and datatype:Category, leaving "White" and "Monster" unmatched
Search for a brand: +path:headphones +(brand:monster +brand:white)
Matches were found for path:headphones and brand:monster, leaving "White" unmatched
Search keywords: +path:headphones +brand:monster +keywords:white
There are matches, and no words from the original query are left, so we return them.
White Foobar Headphones
Search for a category: +(path:foobar path:headphones path:white) +datatype:Category
Matches were found for path:headphones, and datatype:Category, leaving "White" and "Foobar" unmatched
Search for a brand: +path:headphones +(brand:foobar +brand:white)
Nothing was found, so we continue.
Search keywords: +path:headphones +(keywords:white keywords:foobar)
Matches were found for path:headphones and keywords:white, leaving "Foobar" unmatched
... (continue searching other fields, including product description) ...
There are search terms still unmatched ("Foobar"), return "No results found"
The problem I have is twofold:
I don't want the matches to continue once everything's matched (only products have descriptions, so once it reaches that step we'll never return something that's not a product). I could manage this by using denis's GetHitTerms from here, except that I then end up searching for the first matched term in all subsequent fields until everything matches (i.e. in example #2, I'd have +path:headphones +(brand:headphones brand:monster)).
Despite my example above, my actual search query on the path field looks like +path:headphon +datatype:Taxonomy because I'm mangling it for searching. So I can't take the matched term and just remove that from the original query (because "headphon" != "headphones").
Hopefully that makes it clearer what I'm looking for.

I don't understand your use case, but you sound like you're asking about the BooleanQuery API. You can get the clauses of your query by calling getClauses.
A simple example:
BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term("field1","a")), BooleanClause.Occur.SHOULD)
bq.add(new TermQuery(new Term("field1","b")), BooleanClause.Occur.SHOULD)
BooleanClause[] clauses = bq.getClauses();
EDIT: maybe you're just asking for a search algorithm. In pseudocode:
generate_query (qs_that_matched, qs_that_didnt_match, level):
new_query = qs_that_matched AND level:qs_that_didnt_match
qs_still_unmatched = ...
qs_which_just_matched = ...
if qs_still_unmatched != null:
return generate_query(qs_that_matched AND qs_which_just_matched, qs_still_unmatched, level+1)
else:
return qs_that_matched AND qs_which_just_matched

In the end, I built a QueryTree class and stored the queries in a tree structure. It stores a reference to a function that takes a query, a list of terms to pump into that query, whether it should AND or OR those terms, and a list of children (which represent unique combinations of matching terms).
To perform the next level of searching, I just call Evaluate(Func<string, QueryParser.Operator, Query> newQuery) on the deepest nodes in my tree, with a reference to a function which takes terms and an operator and returns the correct Query for that set of logic. The Evaluate function then tests that new query against the list of unmatched terms that have been passed down to it and the result sets of all ancestral Querys (by ANDing with the parent, which ANDs with it's parent and so on). It then creates children for each set of matching terms, using GetHitTerms, and gives the unmatched terms to the child. Repeat for each level of search.
I suspect that there's a better way to do this - I didn't even look into Bobo that Xodarap mentioned, and I never really got faceted searching (as per denis) working. However, it's working, which means it's time to move on to other aspects of the site.

Related

How to handle synonyms and stop words when building a fuzzy query with Hibernate Search Query DSL

Using Hibernate Search (5.8.2.Final) Query DSL to Elasticsearch server.
Given a field analyzer that does lowercase, standard stop-words, then a custom synonym with:
company => co
and finally, a custom stop-word:
co
And we've indexed a vendor name: Great Spaulding Company, which boils down to 2 terms in Elasticsearch after synonyms and stop-words: great and spaulding.
I'm trying to build my query so that each term 'must' match, fuzzy or exact, depending on the term length.
I get the results I want except when 1 of the terms happens to be a synonym or stop-word and long enough that my code adds fuzziness to it, like company~1, in which case, it is no longer seen as a synonym or stop-word and my query returns no match, since 'company' was never stored in the first place b/c it becomes 'co' and then removed as a stop word.
Time for some code. It may seem a bit hacky, but I've tried numerous ways and using simpleQueryString with withAndAsDefaultOperator and building my own phrase seems to get me the closest to the results I need (but I'm open to suggestions). I'm doing something like:
// assume passed in search String of "Great Spaulding Company"
String vendorName = "Great Spaulding Company";
List<String> vendorNameTerms = Arrays.asList(vendorName.split(" "));
List<String> qualifiedTerms = Lists.newArrayList();
vendorNameTerms.forEach(term -> {
int editDistance = getEditDistance(term); // 1..5 = 0, 6..10 = 1, > 10 = 2
int prefixLength = getPrefixLength(term); //appears of no use with simpleQueryString
String fuzzyMarker = editDistance > 0 ? "~" + editDistance : "";
qualifiedTerms.add(String.format("%s%s", term, fuzzyMarker));
});
// join my terms back together with their optional fuzziness marker
String phrase = qualifiedTerms.stream().collect(Collectors.joining(" "));
bool.should(
qb.simpleQueryString()
.onField("vendorNames.vendorName")
.withAndAsDefaultOperator()
.matching(phrase)
.createQuery()
);
As I said above, I'm finding that as long as I don't add any fuzziness to a possible synonym or stop-word, the query finds a match. So these phrases return a match:
"Great Spaulding~1" or "Great Spaulding~1 Co" or "Spaulding Co"
But since my code doesn't know what terms are synonyms or stop-words, it blindly looks at term length and says, oh, 'Company' is greater than 5 characters, I'll make it fuzzy, it builds these sorts of phrases which are NOT returning a match:
"Great Spaulding~1 Company~1" or "Great Company~1"
Why is Elasticsearch not processing Company~1 as a synonym?
Any idea on how I can make this work with simpleQueryString or
another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
[Edit] Same issue happens with punctuation that my analyzer would normally remove. I cannot include any punctuation in the fuzzy search string in my query b/c the ES analyzer doesn't seem to treat it as it would non-fuzzy and I don't get a match result.
Example based on above search string: Great Spaulding Company., gets built in my code to the phrase Great Spaulding~1 Company.,~1 and ES doesn't remove the punctuation or recognize the synonym word Company
I'm going to try a hack of calling ES _analyze REST api in order for it to tell me what tokens I should include in the query, although this will add overhead to every query I build. Similar to http://localhost:9200/myEntity/_analyze?analyzer=vendorNameAnalyzer&text=Great Spaulding Company., produces 3 tokens: great, spaulding and company.
Why is Elasticsearch not processing Company~1 as a synonym?
I'm going to guess it's because fuzzy queries are "term-level" queries, which means they operate on exact terms instead of analyzed text. If your term, once analyzed, resolved to multiple tokens, I don't think it would be easy to define an acceptable behavior for a fuzzy queries.
There's a more detailed explanation there (I believe it still applies to the Lucene version used in Elasticsearch 5.6).
Any idea on how I can make this work with simpleQueryString or another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
You could try reversing your synonym: use co => company instead of company => co, so that a query such as compayn~1 will match even if "compayn" is not analyzed. But that's not a satisfying solution, of course, since other example requiring analysis still won't work, such as Company~1.
Below are alternative solutions.
Solution 1: "match" query with fuzziness
This article describes a way to perform fuzzy searches, and in particular explains the difference between several types of fuzzy queries.
Unfortunately it seems that fuzzy queries in "simple query string" queries are translated in the type of query that does not perform analysis.
However, depending on your requirements, the "match" query may be enough. In order to access all the settings provided by Elasticsearch, you will have to fall back to native query building:
QueryDescriptor query = ElasticsearchQueries.fromJson(
"{ 'query': {"
+ "'match' : {"
+ "'vendorNames.vendorName': {"
// Not that using a proper JSON framework would be better here, to avoid problems with quotes in the terms
+ "'query': '" + userProvidedTerms + "',"
+ "'operator': 'and',"
+ "'fuzziness': 'AUTO'"
+ "}"
+ "}"
+ " } }"
);
List<?> result = session.createFullTextQuery( query ).list();
See this page for details about what "AUTO" means in the above example.
Note that until Hibernate Search 6 is released, you can't mix native queries like shown above with the Hibernate Search DSL. Either you use the DSL, or native queries, but not both in the same query.
Solution 2: ngrams
In my opinion, your best bet when the queries originate from your users, and those users are not Lucene experts, is to avoid parsing the queries altogether. Query parsing involves (at least in part) text analysis, and text analysis is best left to Lucene/Elasticsearch.
Then all you can do is configure the analyzers.
One way to add "fuzziness" with these tools would be to use an NGram filter. With min_gram = 3 and max_gram = 3, for example:
An indexed string such as "company" would be indexed as ["com", "omp", "mpa", "pan", "any"]
A query such as "compayn", once analyzed, would be translated to (essentially com OR omp OR mpa OR pay OR ayn
Such a query would potentially match a lot of documents, but when sorting by score, the document for "Great Spaulding Company" would come up to the top, because it matches almost all of the ngrams.
I used parameter values min_gram = 3 and max_gram = 3 for the example, but in a real world application something like min_gram = 3 and max_gram = 5 would work better, since the added, longer ngrams would give a better score to search terms that match a longer part of the indexed terms.
Of course if you can't sort by score, of if you can't accept too many trailing partial matches in the results, then this solution won't work for you.

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Lucene search on a Hibernate List field

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Adding words to query phrase should filter results in Lucene

I'm going to bounty +100 this question when possible, even if it's already answered and accepted
I'm using Lucene 3.2, here's what I have in my index and code:
More than 10 fields per each indexed document.
OR operand in query phrase (ie: "my lucene search" goes "my OR lucene OR search").
MultiFieldQueryParser with Occur.SHOULD in all fields.
An specific default field containing all other fields (as proposed in this solution How to do a Multi field - Phrase search in Lucene?).
What am I trying to reach? A sort of Google-like search, let me explain:
Search in all fields
Scored results (with boost for specific fields, etc.)
Adding words to the query phrase should filter results
I'm reaching every aspect but this last one. My problems are the following:
If I search only in the default field containing all other fields, I don't get well-scored results
Searching only with AND operand I get way too filtered results, only getting the ones that have the whole query phrase in one field.
Searching only with OR operand works perfect with just one word in the query, but when adding more words to the query phrase, results increase significantly instead of getting filtered (just like Google does).
I don't know how to filter one query from another
This is my actual call to the query parser:
MultiFieldQueryParser.parse(
Version.LUCENE_31,
OrQueryWords, //query words separated with OR operand
searchFields, //String[] searchFields; // all fields
occurs, //Occur[] occurs; {Occur.SHOULD, Occur.SHOULD, etc..}
getFullTextSession().getSearchFactory().getAnalyzer(Product.class)
);
The toString() of this query prints something like this:
(field1:"word1 word2" (field1:word1 field1:word2)) (field2:"word1 word2" (...)) etc.
Right now I'm trying to add the default field (the one containing all other fields) with query words separated with AND operand and Occur.MUST:
MultiFieldQueryParser.parse(
Version.LUCENE_31,
AndQueryWords, //query words separated with AND operand
new String[] {"defaultField"},
new Occur[] {Occur.MUST},
getFullTextSession().getSearchFactory().getAnalyzer(Product.class)
);
The toString() of this query prints this:
+(default:"word1 word2" (+default:word1 +default:word2))
How can I intersect both queries? Is there any other solution to reach it?
I am not sure to understand what you exactly want to achieve, so I am going to give you a few hints on how to customize your scoring when dealing with multi-field multi-term queries.
Intersection of two queries
You seem to be happy with you conjuctive query on the default field resultset, and by your disjunctive query on all fields scoring. You can get the best of both worlds by using the latter as your main query and the former as a filter.
For example:
Query mainQuery, filterQuery;
BooleanQuery query = new BooleanQuery();
// add the main query for scoring
query.add(mainQuery, Occur.SHOULD);
// prevent the filter query to participate in the scoring
filter.setBoost(0);
// make the filter query required
query.add(filterQuery, Occur.MUST);
Minimum should match clauses
If AND-ing all clauses is too restrictive, and OR-ing all clauses is not restrictive enough, then you could do something in between by setting the minimum number of SHOULD clauses that must match so that a document appears in the resultset.
Then the difficult part is to find the right formula to compute the minimum number of SHOULD clauses which must match for optimal user experience.
For example, let's say you want the ceil of 3/4 of the SHOULD clauses to match. Starting with a two-clauses query and adding clauses up to 5 clauses would yield the following evolution of the number of results.
2 terms => ceil(2 * 3 / 4) = 2: all clauses must match
3 terms => ceil(3 * 3 / 4) = 3: 3/4 clauses must match (the new clauses is required, less results)
4 terms => ceil(4 * 3 / 4) = 3: 3/4 clauses must match (one of the clauses is optional, more results)
5 terms => ceil(5 * 3 / 4) = 4: 4/5 clauses must match (maybe more, maybe less results, depending on the co-occurrences of the new term with the 4 first ones)
Anyway, with this feature, the only way for the number of results to shrink as the number of clauses increases is to have a purely conjunctive query.
The approach I've used for solving a similar problem is based on limiting number of results by score.
Unfortunatelly, Lucene doesn't provide such feature out of the box and they also discourage this approach (http://wiki.apache.org/lucene-java/ScoresAsPercentages). Main concern is based on the fact that score's absolute value is meaningless.
I used score's relative value for filtering: I picked the highest score, then calculated minimal accepted score from it (let's say maxScore / 5) and left only those results which satisfied this criterion.

Advice on how to improve a current fuzzy search implementation

I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.
First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').
As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.
When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.
This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.
For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.
Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.
After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.
This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).
Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.
So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.
Wow... tough one.
Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.
However I want to share some thoughts...
Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.
Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!
Tokenizing is generally a good idea.
Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?
I have seen some pretty nice implementations using multi dimensional range trees.
But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.
You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.
Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.
However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.
You would have to break the words apart into partial words when tokenizing to achieve partial matches.
However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.
You can also index the whole word or different concatenations of tokens.
However there may be special bk-tree implementations that support this but i have never seen one.
I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.
(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)
I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.
REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.

Categories