shortest path between two strings - java

I am having to do a problem which creates the adjacency matrix the finds the shortest path between to strings that the user enters. I have already read the data file full of strings and have built the adjacency matrix if the shortest path between strings is one. I am confused on how to do this if the shortest path is 2,3,4,5, etc. The way to tell if the strings are connected is if the first words last three character, two characters, or last character matches the second word's first three characters, first two character, or first three characters.
An example I was given is "everyday" and "daytime" because the last and first three match.
If the last two and first two match an example is "brother" and "eraser".
If the last character and first character matches an example is "scorpion" and "night".
int i,j;
String[] s = new String[sizOfFile];
int[][] a = new int[sizeOfFile][sizeOfFile];
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
if((s[i].charAt(s[i].length()-3) == s[j].charAt(0) && s[i].charAt(s[i].length()-2) == s[j].charAt(1) && s [i].charAt(s[i].length()-1) == s[j].charAt(2)))
{
a[i][j]=1;
}
else if(s[i].charAt(s[i].length()-1) == s[j].charAt(1) && s[i].charAt(s[i].length()-2) == s[j].charAt(0))
{
a[i][j]=1;
}
else if(s[i].charAt(s[i].length()-1) == s[j].charAt(0))
{
a[i][j]=1;
}
else
{
a[i][j]=0;
}
//Prints adjacency matrix
}
}

First, to find the shortest path without wasted effort, you need to do a breadth-first search, not a depth-first search.
E.g. to illustrate what I mean, imagine a directory search for a file by name. You search the current directory and for each sub-directory you call the search method recursively. That is a depth-first search, because it'll search the full hierarchy of the first sub-directory before continuing, and that's not what you want.
Instead, collect all the sub-directory names, then if file not found yet, iterate the list to search each sub-directory, collecting the list of sub-sub-directories, and if file still not found, repeat using new list.
That's is what you want. Collect a list of all the words that are "connected" to the current word, then for each word in that list, build a new list of all words connected to those words, and repeat until target word found.
That was the overall algorithm. The search for connected words should be done by adding all words to a TreeSet.
For each suffix (1-, 2-, or 3-letter), call subSet(suffix, nextSuffix), where nextSuffix is the same as suffix except last character incremented by one.
Check each subset for the target word and stop if found.
Keep a separate HashSet called used of words found so far, to prevent infinite looping and to prevent using longer paths to a word that was already found using a shorter path. Remove any words in the subset that's in the used set.
Add unused words in subset to used set, and (together with "path" of words traveled to get to the word) add to list of words to process on next iteration.
Keep iterating until done, or list is empty (no path found).

Related

Finding the 5 most common words in a text

I have a school task where a part of the task asks us to make a method that finds the 5 most commons words in a .txt file.
The task asks us to put all the words in an ArrayList, which i have already done. The real problem is making the program print out the top 5 words in the text file.
The only "clue" i have is the method name which is:
public Words[] common5(){
}
Iterate through the ArrayList, for each word in the list, put the word into a HashMap where the key is the word and the value is an Integer which you will increase every time you find the word again. At the end iterate through the HashSet and find the top 5 integers. Print those found words.

Google Challenge Dilemma, Insights into possible errors?

I am currently passing 4 of the 5 hidden test cases for this challenge and would like some input
Quick problem description:
You are given two input strings, String chunk and String word
The string "word" has been inserted into "chunk" some number of times
The task is to find the shortest string possible when all instances of
"word" have been removed from "chunk".
Keep in mind during removal, more instances of the "word" might be
created in "chunk". "word" can also be inserted anywhere, including
between "word" instances
If there are more the one shortest possible strings after removal,
return the shortest word that is lexicographic-ally the earliest.
This is easier understood with examples:
Inputs:
(string) chunk = "lololololo"
(string) word = "lol"
Output:
(string) "looo" (since "looo" is eariler than "oolo")
Inputs:
(string) chunk = "goodgooogoogfogoood"
(string) word = "goo"
Output:
(string) "dogfood"
right now I am iterating forwards then backwards, removing all instances of word and then comparing the two results of the two iterations.
Is there a case I am overlooking? Is it possible there is a case where you have to remove from the middle first or something along those lines?
Any insight is appreciated.
I am not sure. But i will avoid matching first and last character of chunk. Should replace all other.

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!
For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set.
This is what i wanna do.
This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.
[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]
You can save this structure out if you need to load it quickly later without all the sorting steps.
What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.
And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.
If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.
[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.
This is simple.
Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.
You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:
for each word
copy the array to a new one
for each letter in the word
get the element position of the letter: position = letter - 'a'
decrease the element value in the new array by one: new_array[position]--
if the value is negative, return not found: if array[position] < 0 {return not found;}
sort the set, then sort each word and do a "merge"-like operation

Printing match from tree

I am trying to create a "word completion" tree java program from a dictionary that is a text file but I am not sure where to go from here. The word completion program will match any words that start with the string entered. I am new to java/ programming. I have designed the tree as a multi way tree with each node storing a character as a letter and a boolean variable to indicate if it is the end of the word (amongst other things).
I am at the point where I am trying to see if my reading in of the file into the tree is working correct. However when I try to print my tree, it is not working correctly. It is not displaying the first letter correctly in every word after the first word. Instead of reading in from file, for testing purposes I am simply adding only 4 words to tree (Base, Basement, Ma, Matthew).
So my question is can anyone tell me why it is not printing correctly AND what I need to do next in order to finish the word completion?
Thank you so much in advance to everyone for taking the time to help me with my problem
it's this part
while(t!=null) {
if(t.down!=null && t.right!=null) {
//System.out.println(t.letter + " children");
//System.out.print(t.letter);
print(t.down);
}
t=t.right;
when you encounter another word you should print, you start it from t.down. You can for example, store all the letters up to that node on mutual stack, print them, and then proceed to printing other letters from tree.
Issue here is: t.down is next letter (from point of view of current node) in some other word.
Try adding more words with common starting substrings to understand my point easily.

Auto Suggest : Substring matching

I am trying to implement auto suggest using ternary search tree(TST),but TST is useful when we are looking for prefix searches, how can we implement Auto Suggest for sub string matches also?
Is there any other data structure which can be used?
Eg of substring matches :
When I am trying to search for UML using auto suggest , even the string "Beginners Guide for UML" should match.
This is from the top of my head, not any proper and proven data structure/algorithm:
Select a mapping of all legal characters to N symbols (for simplicity: 26 symbols for latin letters and similar non-latin letters ignoring case + 1 for non-letters = total 27 symbols).
From your dictionary, create a shallow tree with max branching factor of N (ie. quite high). Leaf nodes would contain references to all words which contain the symbol combo leading from root to that leaf, (intermediate nodes might contain references to words which are shorter than depth of a leaf node, or you could just ignore words which are that short).
Depth of tree would be variable, probably in range of 1..4, so that each leaf node would contain about same number of words (same word of course listed under many leaves, like MATCH under leaves MAT, ATC, TCH if tree depth happened to be 3).
When user is entering letters, follow the tree as far as it goes, until you're left with relatively small set of words. Then do linear filtering on remaining words after you're at leaf of the tree and user enters more text to match. Optionally filter out symbol matches which actually aren't character matches, though it might be nice to match also äöO when user enters ao0, etc.
Optimize number of symbols you map your chars to, to have good branching factor for the tree. Optimize words per leaf to have decent memory usage without having too many words to filter linearly after reaching leaf of the tree.
Of course there are actual researched algorithms for finding a string (what user entered) in a large piece of text (all the phrases/words you want to match), like Aho–Corasick and Rabin–Karp, which are probably worth investigating.

Categories