I am trying to create a "word completion" tree java program from a dictionary that is a text file but I am not sure where to go from here. The word completion program will match any words that start with the string entered. I am new to java/ programming. I have designed the tree as a multi way tree with each node storing a character as a letter and a boolean variable to indicate if it is the end of the word (amongst other things).
I am at the point where I am trying to see if my reading in of the file into the tree is working correct. However when I try to print my tree, it is not working correctly. It is not displaying the first letter correctly in every word after the first word. Instead of reading in from file, for testing purposes I am simply adding only 4 words to tree (Base, Basement, Ma, Matthew).
So my question is can anyone tell me why it is not printing correctly AND what I need to do next in order to finish the word completion?
Thank you so much in advance to everyone for taking the time to help me with my problem
it's this part
while(t!=null) {
if(t.down!=null && t.right!=null) {
//System.out.println(t.letter + " children");
//System.out.print(t.letter);
print(t.down);
}
t=t.right;
when you encounter another word you should print, you start it from t.down. You can for example, store all the letters up to that node on mutual stack, print them, and then proceed to printing other letters from tree.
Issue here is: t.down is next letter (from point of view of current node) in some other word.
Try adding more words with common starting substrings to understand my point easily.
Related
I have a school task where a part of the task asks us to make a method that finds the 5 most commons words in a .txt file.
The task asks us to put all the words in an ArrayList, which i have already done. The real problem is making the program print out the top 5 words in the text file.
The only "clue" i have is the method name which is:
public Words[] common5(){
}
Iterate through the ArrayList, for each word in the list, put the word into a HashMap where the key is the word and the value is an Integer which you will increase every time you find the word again. At the end iterate through the HashSet and find the top 5 integers. Print those found words.
I am having to do a problem which creates the adjacency matrix the finds the shortest path between to strings that the user enters. I have already read the data file full of strings and have built the adjacency matrix if the shortest path between strings is one. I am confused on how to do this if the shortest path is 2,3,4,5, etc. The way to tell if the strings are connected is if the first words last three character, two characters, or last character matches the second word's first three characters, first two character, or first three characters.
An example I was given is "everyday" and "daytime" because the last and first three match.
If the last two and first two match an example is "brother" and "eraser".
If the last character and first character matches an example is "scorpion" and "night".
int i,j;
String[] s = new String[sizOfFile];
int[][] a = new int[sizeOfFile][sizeOfFile];
for(i=0;i<n;i++)
{
for(j=0;j<n;j++)
{
if((s[i].charAt(s[i].length()-3) == s[j].charAt(0) && s[i].charAt(s[i].length()-2) == s[j].charAt(1) && s [i].charAt(s[i].length()-1) == s[j].charAt(2)))
{
a[i][j]=1;
}
else if(s[i].charAt(s[i].length()-1) == s[j].charAt(1) && s[i].charAt(s[i].length()-2) == s[j].charAt(0))
{
a[i][j]=1;
}
else if(s[i].charAt(s[i].length()-1) == s[j].charAt(0))
{
a[i][j]=1;
}
else
{
a[i][j]=0;
}
//Prints adjacency matrix
}
}
First, to find the shortest path without wasted effort, you need to do a breadth-first search, not a depth-first search.
E.g. to illustrate what I mean, imagine a directory search for a file by name. You search the current directory and for each sub-directory you call the search method recursively. That is a depth-first search, because it'll search the full hierarchy of the first sub-directory before continuing, and that's not what you want.
Instead, collect all the sub-directory names, then if file not found yet, iterate the list to search each sub-directory, collecting the list of sub-sub-directories, and if file still not found, repeat using new list.
That's is what you want. Collect a list of all the words that are "connected" to the current word, then for each word in that list, build a new list of all words connected to those words, and repeat until target word found.
That was the overall algorithm. The search for connected words should be done by adding all words to a TreeSet.
For each suffix (1-, 2-, or 3-letter), call subSet(suffix, nextSuffix), where nextSuffix is the same as suffix except last character incremented by one.
Check each subset for the target word and stop if found.
Keep a separate HashSet called used of words found so far, to prevent infinite looping and to prevent using longer paths to a word that was already found using a shorter path. Remove any words in the subset that's in the used set.
Add unused words in subset to used set, and (together with "path" of words traveled to get to the word) add to list of words to process on next iteration.
Keep iterating until done, or list is empty (no path found).
I have a text file with thousands and thousands of lines of gibberish, Hidden somewhere inside is a string of words in english.
What would be the most efficient way to search through the text without having to read it line by line?
Is there a script I could write to read through the file?
I can post the file if anyones interested?
edit: If someone would be willing to show me how to check for words with a BufferedReader in Java that would be really cool!
If you know nothing more than that there is one streak of valid english words somewhere in the file, you will have to read in the file and check each word against a set of valid words (dictionary). On the first hit, you continue to read in the file until the first non-valid word occurs.
This assumes there are no accidentally valid words within the gibberish. In that case, you would have to find all streaks of valid words, and then probably have a human (you) decide which is the right one.
edit: another thing you can do is define a minimum streak length n if you know that the string of words you are looking for consists of a minimum on n valid words. This could at least spare you dealing with all the false positive 1-word-streaks of single accidentally valid words within the gibberish.
I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much
It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().
How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)
For the last project of the semester, the goal is to run searches of a particular phrase on a lyric String inside an Song object, then rank the results based on the length of the substring match. The lyrics were read from a file and match the line breaks in that file.
For example, searching for "She loves you" would return these in the sample matches:
The Beatles: "... She loves you, yeah, yeah, yeah ..." Rank= 13 characters
Bonnie Raitt: "... She just loves you ..." Rank= 18 characters
Elvis Presley: "... You're asking if she loves me\r\nWell, you don't know..." Rank= 23 characters
As you can see from the last example, matches can span multiple lines.
I have all the songs in a TreeMap<String, TreeSet<Song>>, so I get all the songs that match the first word in the query. The difficulty I'm having is searching the String for matches, since a regex won't work in this situation.
When the Song object is constructed, I dumped the lyrics into a Set to run searches for a single word, and to do that I used String.split("[^a-zA-Z}") to separate out the individual words and weed out the punctuation marks. So I want to run my search on that array. The process I'm using goes like:
break up the query into a String array
for each Song in the set
if (song.lyrics.contains(query)
great, break loop to next song
otherwise
int queryCounter=0;
find first index point in String array that matches query[queryCounter]
using that as the start point, iterate through the String array for matches
When the iteration is complete, a Rank object is created to hold the Song, search phrase, start point and end points of the array section that matches. In the Rank object is a method to count the number of characters and compensate for whitespace to calculate the rank. This is then inserted into a PriorityQueue, where the top ten matches will be pulled from the original matchSet.
The problem is that this doesn't prevent false positives, and match ranks can get skewed. For example, Aerosmith's Beyond Beautiful contains "... she loves me she loves you not ..." With my process, I will match "... she loves me she loves you not...", so instead of a rank of 13, I will get a rank of 27.
What changes are necessary for me to weed out the false positives and incorrect rankings?
I would like to add to what jjinguy said:
Basically, in the 'otherwise' block, after you find the first index that matches the start, you also have to look for possible other start points, and reset your start if you find another one
I would keep a list of all possible matches in a song, and finally use the one that has the best rank. Simply resetting the start point might not catch the match with the best rank.
Maybe that isn't the best way, but the concern is still there.