Auto Suggest : Substring matching - java

I am trying to implement auto suggest using ternary search tree(TST),but TST is useful when we are looking for prefix searches, how can we implement Auto Suggest for sub string matches also?
Is there any other data structure which can be used?
Eg of substring matches :
When I am trying to search for UML using auto suggest , even the string "Beginners Guide for UML" should match.

This is from the top of my head, not any proper and proven data structure/algorithm:
Select a mapping of all legal characters to N symbols (for simplicity: 26 symbols for latin letters and similar non-latin letters ignoring case + 1 for non-letters = total 27 symbols).
From your dictionary, create a shallow tree with max branching factor of N (ie. quite high). Leaf nodes would contain references to all words which contain the symbol combo leading from root to that leaf, (intermediate nodes might contain references to words which are shorter than depth of a leaf node, or you could just ignore words which are that short).
Depth of tree would be variable, probably in range of 1..4, so that each leaf node would contain about same number of words (same word of course listed under many leaves, like MATCH under leaves MAT, ATC, TCH if tree depth happened to be 3).
When user is entering letters, follow the tree as far as it goes, until you're left with relatively small set of words. Then do linear filtering on remaining words after you're at leaf of the tree and user enters more text to match. Optionally filter out symbol matches which actually aren't character matches, though it might be nice to match also äöO when user enters ao0, etc.
Optimize number of symbols you map your chars to, to have good branching factor for the tree. Optimize words per leaf to have decent memory usage without having too many words to filter linearly after reaching leaf of the tree.
Of course there are actual researched algorithms for finding a string (what user entered) in a large piece of text (all the phrases/words you want to match), like Aho–Corasick and Rabin–Karp, which are probably worth investigating.

Related

Storing Paragraph # in Trie

I am building a Trie in Java. When searching the trie for a keyword, the entry for the keyword needs to also store which paragraphs the keyword appears in in the text. Does anyone have some insight into how I would go about storing the paragraph number in the trie with the word? Do I index the whole text and then put it into the trie? I'm a little stumped!
Usually a trie is a tree constructed by having some node type, that has a list of child nodes of the same type, where each child again has a list and so on. Now every node in the trie correspond to exactly one word and vice versa, so if you make an extra field in the node type you can store additional information, such as a paragraph number there.
In order to construct this, simply loop through every word and add it to the trie by walking down the trie and adding missing nodes, then mark the node corresponding to the word with the paragraph number. (not every node on the way to the word, only the last node)
Note that since a word may appear in several paragraphs, you probably want a list of paragraph numbers in each node. This way you can also have an empty list in the nodes for words which don't exist in the text.

OCR specific approximate string matching library

I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:
'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'
As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.
Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?
I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.
Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.
Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.
If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.

Fill in the Blank String

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.
Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty
It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!
For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set.
This is what i wanna do.
This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.
[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]
You can save this structure out if you need to load it quickly later without all the sorting steps.
What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.
And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.
If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.
[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.
This is simple.
Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.
You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:
for each word
copy the array to a new one
for each letter in the word
get the element position of the letter: position = letter - 'a'
decrease the element value in the new array by one: new_array[position]--
if the value is negative, return not found: if array[position] < 0 {return not found;}
sort the set, then sort each word and do a "merge"-like operation

string matching algorithms used by lucene

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.
As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).
In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.
The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:
Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).
So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):
Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.
Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.
The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.
As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.
Please look lucene API docs for TokenFilter class.

Categories