Repetition of divided character-chains in String java

Repetition of divided character-chains in String java - java

There are many questions like this one out there on stackoverflow, I know.
Despite that, I cannot seem to find a clue for a solution.
I have to write a class for a program which can decrypt Vigenere with the Kasiski test. My job is to write the class "Repetition" which finds repetitions of character-chains in a string and a lot more.
Example String s = "abcfghkjngfabcdfgkjdfabc"
Repetitions must be at least 3 characters long, like abc in this case. If it makes any difference, the longer the character chain the better.
I have been reading about Maps for some time now already, but the solutions I found deal with words which need to be counted, seperated by comma or space, which does not work for my String.
I am not asking for complete code since I'm eager to find the exact solution myself (I'd like to get better at programming and not copy+paste) and a clue for where to search would help me out already. (By the way, I am a newbie to programming).
Thanks in advance.

Related

Balanced brackets algorithm with different symbols

We all know the regular balanced brackets algorithm and it's appeared here in many variations, but I have another twist.
I know how to answer it with ,or without a stack, and I have read all of the threads relating this problem here ,and on google but I haven't found any answer that matches my problem.
Let's invent a different kind of math, instead of '()','{}','[]' we use '**','$$','##'. Now show an algorithm to check wheatear the parenthesis are balanced.
for example: *##$$##*** is legit, equivalent to ([[{}]])() or even ([]{}).
It's probably not the best different symbols mean different things if it was real math, but it's not math were dealing with...
I tried using the solution for the 'regular' symbols and adjust it to work with this new symbols but for now I've failed. The problem is of course we can't tell between opening and closing symbol. Can anyone suggest a solution?
This was given to me as one of the questions in a job interview, while before that I solved it with the regular symbols. Not that it matters but I was asked to answer using Java.

You can just be greedy and pop any time that you have a match on the top of the stack.
def is_valid(string):
stack = []
for char in string:
if stack and char == stack[-1]:
stack.pop()
else:
stack.push(char)
return not stack

Find if long string of chars contains any patterns

So I have a long string of chars for example - "wdllwdwwwlldd"
The string just contains the same chars -wld (try and guess what I'm doing ;))
The string will be quite long, approx 420 chars long.
I want to find, if they exist, any patterns in the string.
For example if the string was - "wllddwllddwlldd"
then it "wlldd" would be the pattern that was found.
So i kind of want to find Any repeated sequences in the string.
Having done a bit of research, suffix trees and suffix arrays seem to get mentioned a lot on these problems.
Is thst correct or is there another way to do this?
I can tell that this is quite a large task and could potentially take a long time.
Thanks in advance.

So what you want is to extract all occurrences of some pattern from some string, did I get your point? If so, something very similar was discussed in this thread. It should at least send you in the right direction.
In your case, using regular expression such as w+l+d+ should do the trick.
EDIT
Question clarified a bit more ... so the algorithm you're looking for is in a detail explained in this post

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?

You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.

It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Sentence Auto-Complete with Java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?

I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well

If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words.
And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.
I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?
Please, explain me the pros of using BreakIterator and the real cases when it should be used.
If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?

The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.
It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.
If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.
Find the sentence where any word occurs the most repeated times. Output it along with that word
Find the word used most times throughout the whole string.
Find all words that occur in every sentence
Find all words that occur a prime number of times in 2 or more sentences
The list goes on...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.