Storing Paragraph # in Trie - java

I am building a Trie in Java. When searching the trie for a keyword, the entry for the keyword needs to also store which paragraphs the keyword appears in in the text. Does anyone have some insight into how I would go about storing the paragraph number in the trie with the word? Do I index the whole text and then put it into the trie? I'm a little stumped!

Usually a trie is a tree constructed by having some node type, that has a list of child nodes of the same type, where each child again has a list and so on. Now every node in the trie correspond to exactly one word and vice versa, so if you make an extra field in the node type you can store additional information, such as a paragraph number there.
In order to construct this, simply loop through every word and add it to the trie by walking down the trie and adding missing nodes, then mark the node corresponding to the word with the paragraph number. (not every node on the way to the word, only the last node)
Note that since a word may appear in several paragraphs, you probably want a list of paragraph numbers in each node. This way you can also have an empty list in the nodes for words which don't exist in the text.

Related

Word Search: two string arrays in alphabetical order using merge sort

For my class project, we have to go through the Shakespeare sonnet and check if each word is in the dictionary or not. Now I have two String arrays both in alphabetical order, one consists of the words from the sonnet and the other one is consisted of the word from the dictionary. I am asked to use the merge sort to check if the word in the sonnet exists in the dictionary. Can anyone give me an idea of how I can implement this??? Thanks in advance!
The idea is to:
Sort both of the arrays (with merge sort)
Remove any duplicates
Iterate through both of the sorted arrays simultaneously (can be done using the merging procedure in mergesort) and check if the next word in the sonnet list equals the next word in the dictionary. If it does not, remove it, and mark it as "not in dictionary", if it is, mark it as "in the dictionary", and proceed to the next element in both lists
However, this approach assumes that all of the words in the dictionary is contained in the sonnet. If this is not the case, you would have to remove those words up front.
Really though; this doesn't sound like a sort problem.
The best approach would be to use a HashMap and put all the dictionary words in that. Then you could iterate through the sonnet, and check for existence in the map.

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!
For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set.
This is what i wanna do.
This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.
[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]
You can save this structure out if you need to load it quickly later without all the sorting steps.
What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.
And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.
If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.
[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.
This is simple.
Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.
You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:
for each word
copy the array to a new one
for each letter in the word
get the element position of the letter: position = letter - 'a'
decrease the element value in the new array by one: new_array[position]--
if the value is negative, return not found: if array[position] < 0 {return not found;}
sort the set, then sort each word and do a "merge"-like operation

Printing match from tree

I am trying to create a "word completion" tree java program from a dictionary that is a text file but I am not sure where to go from here. The word completion program will match any words that start with the string entered. I am new to java/ programming. I have designed the tree as a multi way tree with each node storing a character as a letter and a boolean variable to indicate if it is the end of the word (amongst other things).
I am at the point where I am trying to see if my reading in of the file into the tree is working correct. However when I try to print my tree, it is not working correctly. It is not displaying the first letter correctly in every word after the first word. Instead of reading in from file, for testing purposes I am simply adding only 4 words to tree (Base, Basement, Ma, Matthew).
So my question is can anyone tell me why it is not printing correctly AND what I need to do next in order to finish the word completion?
Thank you so much in advance to everyone for taking the time to help me with my problem
it's this part
while(t!=null) {
if(t.down!=null && t.right!=null) {
//System.out.println(t.letter + " children");
//System.out.print(t.letter);
print(t.down);
}
t=t.right;
when you encounter another word you should print, you start it from t.down. You can for example, store all the letters up to that node on mutual stack, print them, and then proceed to printing other letters from tree.
Issue here is: t.down is next letter (from point of view of current node) in some other word.
Try adding more words with common starting substrings to understand my point easily.

Auto Suggest : Substring matching

I am trying to implement auto suggest using ternary search tree(TST),but TST is useful when we are looking for prefix searches, how can we implement Auto Suggest for sub string matches also?
Is there any other data structure which can be used?
Eg of substring matches :
When I am trying to search for UML using auto suggest , even the string "Beginners Guide for UML" should match.
This is from the top of my head, not any proper and proven data structure/algorithm:
Select a mapping of all legal characters to N symbols (for simplicity: 26 symbols for latin letters and similar non-latin letters ignoring case + 1 for non-letters = total 27 symbols).
From your dictionary, create a shallow tree with max branching factor of N (ie. quite high). Leaf nodes would contain references to all words which contain the symbol combo leading from root to that leaf, (intermediate nodes might contain references to words which are shorter than depth of a leaf node, or you could just ignore words which are that short).
Depth of tree would be variable, probably in range of 1..4, so that each leaf node would contain about same number of words (same word of course listed under many leaves, like MATCH under leaves MAT, ATC, TCH if tree depth happened to be 3).
When user is entering letters, follow the tree as far as it goes, until you're left with relatively small set of words. Then do linear filtering on remaining words after you're at leaf of the tree and user enters more text to match. Optionally filter out symbol matches which actually aren't character matches, though it might be nice to match also äöO when user enters ao0, etc.
Optimize number of symbols you map your chars to, to have good branching factor for the tree. Optimize words per leaf to have decent memory usage without having too many words to filter linearly after reaching leaf of the tree.
Of course there are actual researched algorithms for finding a string (what user entered) in a large piece of text (all the phrases/words you want to match), like Aho–Corasick and Rabin–Karp, which are probably worth investigating.

Java Binary Search Tree Prefixes

I am writing a program in Java that takes words and definitions, places both the word and definition in a node object, and places that node in a binary search tree dictionary sorted lexicographically by word.
I am trying to create an option for the user to find all tree words that begin with a certain prefix of letters. For instance, given the input "ap", the program might return the words "appease", "apple", "apply", "apron", etc. However, I have no idea how to implement this. My binary search tree class has a find method and a traversal method (using iterators), but I don't know how to use those to search through the node objects, as the dictionary class (that stores the nodes in the tree) cannot handle anything like that. Does anyone have any ideas on how to tackle this?

Categories