Find string section that contains another String, with possible intervening words - java

For the last project of the semester, the goal is to run searches of a particular phrase on a lyric String inside an Song object, then rank the results based on the length of the substring match. The lyrics were read from a file and match the line breaks in that file.
For example, searching for "She loves you" would return these in the sample matches:
The Beatles: "... She loves you, yeah, yeah, yeah ..." Rank= 13 characters
Bonnie Raitt: "... She just loves you ..." Rank= 18 characters
Elvis Presley: "... You're asking if she loves me\r\nWell, you don't know..." Rank= 23 characters
As you can see from the last example, matches can span multiple lines.
I have all the songs in a TreeMap<String, TreeSet<Song>>, so I get all the songs that match the first word in the query. The difficulty I'm having is searching the String for matches, since a regex won't work in this situation.
When the Song object is constructed, I dumped the lyrics into a Set to run searches for a single word, and to do that I used String.split("[^a-zA-Z}") to separate out the individual words and weed out the punctuation marks. So I want to run my search on that array. The process I'm using goes like:
break up the query into a String array
for each Song in the set
if (song.lyrics.contains(query)
great, break loop to next song
otherwise
int queryCounter=0;
find first index point in String array that matches query[queryCounter]
using that as the start point, iterate through the String array for matches
When the iteration is complete, a Rank object is created to hold the Song, search phrase, start point and end points of the array section that matches. In the Rank object is a method to count the number of characters and compensate for whitespace to calculate the rank. This is then inserted into a PriorityQueue, where the top ten matches will be pulled from the original matchSet.
The problem is that this doesn't prevent false positives, and match ranks can get skewed. For example, Aerosmith's Beyond Beautiful contains "... she loves me she loves you not ..." With my process, I will match "... she loves me she loves you not...", so instead of a rank of 13, I will get a rank of 27.
What changes are necessary for me to weed out the false positives and incorrect rankings?

I would like to add to what jjinguy said:
Basically, in the 'otherwise' block, after you find the first index that matches the start, you also have to look for possible other start points, and reset your start if you find another one
I would keep a list of all possible matches in a song, and finally use the one that has the best rank. Simply resetting the start point might not catch the match with the best rank.
Maybe that isn't the best way, but the concern is still there.

Related

Why is my String array length 3 instead of 2?

I'm trying to understand regex. I wanted to make a String[] using split to show me how many letters are in a given string expression?
import java.util.*;
import java.io.*;
public class Main {
public static String simpleSymbols(String str) {
String result = "";
String[] alpha = str.split("[\\+\\w\\+]");
int alphaLength = alpha.length;
// System.out.print(alphaLength);
String[] charCount = str.split("[a-z]");
int charCountLength = charCount.length;
System.out.println(charCountLength);
}
}
My input string is "+d+=3=+s+". I split the string to count the number of letters in string. The array length should be two but I'm getting three. Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
So, a few things pop out to me:
First, your regex looks correct. If you're ever worried about how your regex will perform, you can use https://regexr.com/ to check it out. Just put your regex on the top and enter your string in the bottom to see if it is matching correctly
Second, upon close inspection, I see you're using the split function. While it is convenient for quickly splitting strings, you need to be careful as to what you are splitting on. In this case, you're removing all of the strings that you were initially looking at, which would make it impossible to find. If you print it out, you would notice that the following shows (for an input string of +d+=3=+s+):
+
+=3=+
+
Which shows that you accidentally cut out what you were looking to find in the first place. Now, there are several ways of fixing this, depending on what your criteria is.
Now, if what you wanted was just to separate on all +s and it doesn't matter that you find only what is directly bounded by +s, then split works awesome. Just do str.split("+"), and this will return you a list of the following (for +d+=3=+s+):
d
=3=
s
However, you can see that this poses a few problems. First, it doesn't strip out the =3= that we don't want, and second, it does not truly give us values that are surrounded by a +_+ format, where the underscore represents the string/char you're looking for.
Seeing as you're using +w, you intend to find words that are surrounded by +s. However, if you're just looking to find one character, I would suggest using another like [a-z] or [a-zA-Z] to be more specific. However, if you want to find multiple alphabetical characters, your pattern is fine. You can also add a * (0 or more) or a + (1 or more) at the end of the pattern to dictate what exactly you're looking for.
I won't give you the answer outright, but I'll give you a clue as to what to move towards. Try using a pattern and a matcher to find the regex that you listed above and then if you find a match, make sure to store it somewhere :)
Also, for future reference, you should always start a function name with a lower case, at least in Java. Only constants and class names should start in a capital :)
I am trying to use split to count the number of letters in that string. The array length should be two, but I'm getting three.
The regex in the split functions is used as delimiters and will not be shown in results. In your case "str.split([a-z])" means using alphabets as delimiters to separate your input string, which makes three substrings "(+)|d|(+=3=+)|s|(+)".
If you really want to count the number of letters using "split", use 'str.split("[^a-z]")'. But I would recommend using "java.util.regex.Matcher.find()" in order to find out all letters.
Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
Similarly, check the functions in "java.util.regex.Matcher".

Fill in the Blank String

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.
Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty
It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

Method to get position of match in String within a range

My question is very similar to this one
Java: method to get position of a match in a String?
Except that I will be searching through very long strings (hundreds of megabytes) so I would like for the method to give up after a certain index.
Something like this except with an end index as well. Is there a library that provides this functionality?
Ye you can co this in 2 steps.
Get the Substring in which you want to search i.e. from beginning to the endIndex(or Certain index) using subString(0, certainIndex)
Then use indexOf(matchString) to find the location i=of the first occurence of the pattern into the String.

Replacing parts of a string with characters JAVA

I know I was here earlier asking something similar, but I think I have narrowed down what i want to ask.
Ok, so I am making a program that plays the game of hangman on the jedit console. The user will guess one character at a time. At the beginning of the game, the program will display asterisks the same length of the word they are guessing. They have as many guesses as letters in the word. When they get a letter correct, the program will display the letters in place of asterisks. Here is an example of what the console should look like.
if the word is homework ********
they guess the letter e ***e**** (the bold e just happened because stars so that, it doesn't need to be bold)
then they guess the letter h h**e****
etc until there are no more asterisks
So I created a method that prints out the number of asterisks based on the number of letters in the word. I don't know how to place the letters in the place of the asterisks. I want to know if I should make a method that replaces the asterisks, or how else I can go about this. Thank you in advance for the help.
p.s I am not asking for anyone to dump code on me, that is not what I want. Just having help, and me having someone to ask questions to about things that I don't understand would be nice. by the way, I am in an intro to computer science class, so my knowledge of java is fairly low.
There are many ways you could approach this. The first that popped into my head is that you could start with a char[] the same length as the answer string. Look up the Arrays class for an easy way to fill it with asterisks. As the user guesses letters, search the answer string for that letter and replace the corresponding indexes of the char[]. Then construct a String from the char[] and display it.
Why not make something more clever, make a list of all the chars guessed so far and each time you want to print the word just go over each letter and replace it with * if not in the set.
Short: make a set of all the guesses so far. You don't have to work on the same data structure as you show the user.
I would use a list of characters instead of String for ****.
List<Character> hiddenWord = new ArrayList<Character>();
Instantiate the list with the number of * you need.
Create a function that will receive the guessed letter.
Check if the word contains that letter (use indexOf(int ch, int fromIndex) repeatedly until you get -1 - read about it here), and for each result you get that is !=-1, set the position in the array to be that letter (something like hiddenWord.set(poz, letter), where poz is the result of indexOf and letter is the guessed letter).
You can use StringBuffer insead String. In class StringBuffer exists method setCharAt.
Breifly, you will have variable String word - for guessing word, and StringBuilder guess for asterisks and guesed letters. When letter is guessed you will update guess with setCharAt.

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!
For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set.
This is what i wanna do.
This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.
[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]
You can save this structure out if you need to load it quickly later without all the sorting steps.
What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.
And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.
If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.
[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.
This is simple.
Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.
You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:
for each word
copy the array to a new one
for each letter in the word
get the element position of the letter: position = letter - 'a'
decrease the element value in the new array by one: new_array[position]--
if the value is negative, return not found: if array[position] < 0 {return not found;}
sort the set, then sort each word and do a "merge"-like operation

Categories