Java Binary Search Tree Prefixes - java

I am writing a program in Java that takes words and definitions, places both the word and definition in a node object, and places that node in a binary search tree dictionary sorted lexicographically by word.
I am trying to create an option for the user to find all tree words that begin with a certain prefix of letters. For instance, given the input "ap", the program might return the words "appease", "apple", "apply", "apron", etc. However, I have no idea how to implement this. My binary search tree class has a find method and a traversal method (using iterators), but I don't know how to use those to search through the node objects, as the dictionary class (that stores the nodes in the tree) cannot handle anything like that. Does anyone have any ideas on how to tackle this?

Related

Storing Paragraph # in Trie

I am building a Trie in Java. When searching the trie for a keyword, the entry for the keyword needs to also store which paragraphs the keyword appears in in the text. Does anyone have some insight into how I would go about storing the paragraph number in the trie with the word? Do I index the whole text and then put it into the trie? I'm a little stumped!
Usually a trie is a tree constructed by having some node type, that has a list of child nodes of the same type, where each child again has a list and so on. Now every node in the trie correspond to exactly one word and vice versa, so if you make an extra field in the node type you can store additional information, such as a paragraph number there.
In order to construct this, simply loop through every word and add it to the trie by walking down the trie and adding missing nodes, then mark the node corresponding to the word with the paragraph number. (not every node on the way to the word, only the last node)
Note that since a word may appear in several paragraphs, you probably want a list of paragraph numbers in each node. This way you can also have an empty list in the nodes for words which don't exist in the text.

How get Expression tree of regex in Java?

I'm working in the conversion algorithm obtains a DFA from a regular expression. This algorithm is limited to only operators (*, |, . ).
For those who do not know the meaning of each operator can check this.
The algorithm analyzes the nodes of a tree that is created from a regex
Here I attached an image showing a table with the functions voidable first position and last position which is applied to each node of the tree created.
For example: If apply for this regex (a│b)*a(a│b) analysis with the table.
The first step of the algorithm is to add the symbol # at the end (a│b)*a(a│b)# and enumerate each symbol:
Later the tree is constructed (my problem) and each node is discussed in the above table, remaining so. To the right of the node in {} PmraPos(first position) and left at {} UtmaPos (end position).
Problem: In java I was trying to build the tree we spoke with Stack but got good results, because, as you can see in the picture, not all nodes have two children. I want help for building the tree.
Note: What I did to try to build the tree, it was to pass the regular expression to its postfix form.

Fill in the Blank String

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.
Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty
It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

Auto Suggest : Substring matching

I am trying to implement auto suggest using ternary search tree(TST),but TST is useful when we are looking for prefix searches, how can we implement Auto Suggest for sub string matches also?
Is there any other data structure which can be used?
Eg of substring matches :
When I am trying to search for UML using auto suggest , even the string "Beginners Guide for UML" should match.
This is from the top of my head, not any proper and proven data structure/algorithm:
Select a mapping of all legal characters to N symbols (for simplicity: 26 symbols for latin letters and similar non-latin letters ignoring case + 1 for non-letters = total 27 symbols).
From your dictionary, create a shallow tree with max branching factor of N (ie. quite high). Leaf nodes would contain references to all words which contain the symbol combo leading from root to that leaf, (intermediate nodes might contain references to words which are shorter than depth of a leaf node, or you could just ignore words which are that short).
Depth of tree would be variable, probably in range of 1..4, so that each leaf node would contain about same number of words (same word of course listed under many leaves, like MATCH under leaves MAT, ATC, TCH if tree depth happened to be 3).
When user is entering letters, follow the tree as far as it goes, until you're left with relatively small set of words. Then do linear filtering on remaining words after you're at leaf of the tree and user enters more text to match. Optionally filter out symbol matches which actually aren't character matches, though it might be nice to match also äöO when user enters ao0, etc.
Optimize number of symbols you map your chars to, to have good branching factor for the tree. Optimize words per leaf to have decent memory usage without having too many words to filter linearly after reaching leaf of the tree.
Of course there are actual researched algorithms for finding a string (what user entered) in a large piece of text (all the phrases/words you want to match), like Aho–Corasick and Rabin–Karp, which are probably worth investigating.

string matching algorithms used by lucene

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.
As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).
In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.
The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:
Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).
So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):
Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.
Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.
The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.
As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.
Please look lucene API docs for TokenFilter class.

Categories