Problem
I want to decode a message encrypted with classic Viginere. I know that the key has a length of exactly 6 characters.
The message is:
BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM
Question
I tried a brute-force approach but unfortunately this yields an extreme amount of combinations, too many to compute.
Do you have any idea how to go from here or how to approach this problem in general?
Attempt
Here is what i have so far:
public class Main {
// instance variables - replace the example below with your own
private String message;
private String answer;
private String first;
/**
* Constructor for objects of class Main
*/
public Main()
{
// initialise instance variables
message ="BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";
for (int x = 0; x < message.length() / 6; x++) {
int index = x * 6;
first = new StringBuilder()
.append(first)
.append(message.charAt(index))
.toString();
}
System.out.println(first);
}
}
Non-text message
In case the raw message is not actual text (like english text that makes sense) or you have no information about its content, you will be out of luck.
Especially if the text is actually hashed or double-encrypted, i.e. random stuff.
Breaking an encryption scheme requires knowledge about the algorithm and the messages. Especially in your situation, you will need to know the general structure of your messages in order to break it.
Prerequisites
For the rest of this answer, let me assume your message is actually plain english text. Note that you can easily adopt my answer to other languages. Or even adopt the techniques to other message formats.
Let me also assume that you are talking about classic Vigenere (see Wikipedia) and not about one of its many variants. That means that your input consists only of the letters A to Z, no case, no interpunction, no spaces. Example:
MYNAMEISJOHN // Instead of: My name is John.
The same also applies to your key, it only contains A to Z.
Classic Viginere then shifts by the offset in the alphabet, modulo the alphabet size (which is 26).
Example:
(G + L) % 26 = R
Dictionary
Before we talk about attacks we need to find a way to, given a generated key, find out whether it is actually correct or not.
Since we know that the message consists of english text, we can just take a dictionary (a huge list of all valid english words) and compare our decrypted message against the dictionary. If the key was wrong, the resulting message will not contain valid words (or only a few).
This can be a bit tricky since we lack interpunction (in particular, no spaces).
N-grams
Good thing that there is actually a very accurate way of measuring how valid a text is, which also solves the issue with the missing interpunction.
The technique is called N-grams (see Wikipedia). You choose a value for N, for example 3 (then called tri-grams) and start splitting your text into pairs of 3 characters. Example:
MYNAMEISJOHN // results in the trigrams:
$$M, $$MY, MYN, YNA, NAM, AME, MEI, ISJ, SJO, JOH, OHN, HN$, N$$
What you need now is a frequency analysis of the most common tri-grams in english text. There exist various sources online (or you can run it yourself on a big text corpus).
Then you simply compare your tri-gram frequency to the frequency for real text. Using that, you compute a score of how well your frequency matches the real frequency. If your message contains a lot of very uncommon tri-grams, it is highly likely to be garbage data and not real text.
A small note, mono-grams (1-gram) result in a single character frequency (see Wikipedia#Letter frequency). Bi-grams (2-gram) are used commonly for cracking Viginere and yield good results.
Attacks
Brute-Force
The first and most straightforward attack is always brute-force. And, as long as the key and the alphabet is not that big, the amount of combinations is relatively low.
Your key has length 6, the alphabet has size 26. So the amount of different key combinations is 6^26, which is
170_581_728_179_578_208_256
So about 10^20. This number might appear huge, but do not forget that CPUs operate already in the Gigahertz range (10^9 operations per second, per core). That means that a single core with 1 GHz will have generated all solutions in about 317 years. Now replace that by a powerful CPU (or even GPU) and with a multi-core machine (there are clusters with millions of cores), then this is computed in less than a day.
But okay, I get that you most likely do not have access to such a hardcore cluster. So a full brute-force is not feasible.
But do not worry. There are simple tricks to speed this up. You do not have to compute the full key. How about limiting yourself to the first 3 characters instead of the full 6 characters. You will only be able to decrypt a subset of the text then, but it is enough to analyze whether the outcome is valid text or not (using dictionaries and N-grams, as mentioned before).
This small change already drastically cuts down computation time since you then only have 3^26 combinations. Generating those takes around 2 minutes for a single 1 GHz core.
But you can do even more. Some characters are extremely rare in english text, for example Z. You can simply start by not considering keys that would translate to those values in the text. Let us say you remove the 6 least common characters by that, then your combinations are only 3^20. This takes around 100 milliseconds for a single 1 GHz core. Yes, milliseconds. That is fast enough for your average laptop.
Frequency Attack
Enough brute-force, let us do something clever. A letter frequency attack is a very common attack against those encryption schemes. It is simple, extremely fast and very successful. In fact, it is so simple that there are quite some online tools that offer this for free, for example guballa.de/vigenere-solver (it is able to crack your specific example, I just tried it out).
While Viginere changes the message to unreadable garbage, it does not change the distribution of letters, at least not per digit of the key. So if you look at, let's say the second digit of your key, from there on, every sixth letter (length of the key) in the message will be shifted by the exact same offset.
Let us take a look at a simple example. The key is BAC and the message is
CCC CCC CCC CCC CCC // raw
DCF DCF DCF DCF DCF // decrypted
As you notice, the letters repeat. Looking at the third letter, it is always F. So that means that the sixth and ninth letter, which are also F, all must be the exact same original letter. Since they where all shifted by the C from the key.
That is a very important observation. It means that letter frequency is, within a multiple of a digit of the key (k * (i + key_length)), preserved.
Let us now take a look at the letter distribution in english text (from Wikipedia):
All you have to do now is to split your message into its blocks (modulo key-length) and do a frequency analysis per digit of the blocks.
So for your specific input, this yields the blocks
BYOIZR
LAUMYX
XPFLPW
BZLMLQ
PBJMSC
...
Now you analyze the frequency for digit 1 of each block, then digit 2, and so on, until digit 6. For the first digit, this are the letters
B, L, X, B, P, ...
The result for your specific input is:
[B=0.150, E=0.107, X=0.093, L=0.079, Q=0.079, P=0.071, K=0.064, I=0.050, O=0.050, R=0.043, F=0.036, J=0.036, A=0.029, S=0.029, Y=0.021, Z=0.021, C=0.014, T=0.014, D=0.007, V=0.007]
[L=0.129, O=0.100, H=0.093, A=0.079, V=0.071, Y=0.071, B=0.057, K=0.057, U=0.050, F=0.043, P=0.043, S=0.043, Z=0.043, D=0.029, W=0.029, N=0.021, C=0.014, I=0.014, J=0.007, T=0.007]
[W=0.157, Z=0.093, K=0.079, L=0.079, V=0.079, A=0.071, G=0.071, J=0.064, O=0.050, X=0.050, D=0.043, U=0.043, S=0.036, Q=0.021, E=0.014, F=0.014, N=0.014, M=0.007, T=0.007, Y=0.007]
[M=0.150, P=0.100, Q=0.100, I=0.079, B=0.071, Z=0.071, L=0.064, W=0.064, K=0.057, V=0.043, E=0.036, A=0.029, C=0.029, N=0.029, U=0.021, H=0.014, S=0.014, D=0.007, G=0.007, J=0.007, T=0.007]
[L=0.136, Y=0.100, A=0.086, O=0.086, P=0.086, U=0.086, H=0.064, K=0.057, V=0.050, Z=0.050, S=0.043, J=0.029, M=0.021, T=0.021, W=0.021, G=0.014, I=0.014, B=0.007, C=0.007, N=0.007, R=0.007, X=0.007]
[I=0.129, M=0.107, X=0.100, L=0.086, W=0.079, S=0.064, R=0.057, H=0.050, Q=0.050, K=0.043, E=0.036, C=0.029, T=0.029, V=0.029, F=0.021, J=0.021, P=0.021, G=0.014, Y=0.014, A=0.007, D=0.007, O=0.007]
Look at it. You see that for the first digit the letter B is very common, 15%. And then letter E with 10% and so on. There is a high chance that letter B, for the first digit of the key, is an alias for E in the real text (since E is the most common letter in english text) and that the E stands for the second most common letter, namely T.
Using that you can easily reverse-compute the letter of the key used for encryption. It is obtained by
B - E % 26 = X
Note that your message distribution might not necessary align with the real distribution over all english text. Especially if the message is not that long (the longer, the more accurate is the distribution computation) or mainly consists of weird and unusual words.
You can counter that by trying out a few combinations among the highest of your distribution. So for the first digit you could try out whether
B -> E
E -> E
X -> E
L -> E
Or instead of mapping to E only, also try out the second most common character T:
B -> T
E -> T
X -> T
L -> T
The amount of combinations you get with that is very low. Use dictionaries and N-grams (as mentioned before) to validate whether the key is correct or not.
Java Implementation
Your message is actually very interesting. It perfectly aligns with the real letter frequency over english text. So for your particular case you actually do not need to try out any combinations, nor do you need to do any dictionary/n-gram checks. You can actually just translate the most common letter in your encrypted message (per digit) to the most common character in english text, E, and get the real actual key.
Since that is so simple and trivial, here is a full implementation in Java for what I explained before step by step, with some debug outputs (it is a quick prototype, not really nicely structured):
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public final class CrackViginere {
private static final int ALPHABET_SIZE = 26;
private static final char FIRST_CHAR_IN_ALPHABET = 'A';
public static void main(final String[] args) {
String encrypted =
"BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";
int keyLength = 6;
char mostCommonCharOverall = 'E';
// Blocks
List<String> blocks = new ArrayList<>();
for (int startIndex = 0; startIndex < encrypted.length(); startIndex += keyLength) {
int endIndex = Math.min(startIndex + keyLength, encrypted.length());
String block = encrypted.substring(startIndex, endIndex);
blocks.add(block);
}
System.out.println("Individual blocks are:");
blocks.forEach(System.out::println);
// Frequency
List<Map<Character, Integer>> digitToCounts = Stream.generate(HashMap<Character, Integer>::new)
.limit(keyLength)
.collect(Collectors.toList());
for (String block : blocks) {
for (int i = 0; i < block.length(); i++) {
char c = block.charAt(i);
Map<Character, Integer> counts = digitToCounts.get(i);
counts.compute(c, (character, count) -> count == null ? 1 : count + 1);
}
}
List<List<CharacterFrequency>> digitToFrequencies = new ArrayList<>();
for (Map<Character, Integer> counts : digitToCounts) {
int totalCharacterCount = counts.values()
.stream()
.mapToInt(Integer::intValue)
.sum();
List<CharacterFrequency> frequencies = new ArrayList<>();
for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
double frequency = entry.getValue() / (double) totalCharacterCount;
frequencies.add(new CharacterFrequency(entry.getKey(), frequency));
}
Collections.sort(frequencies);
digitToFrequencies.add(frequencies);
}
System.out.println("Frequency distribution for each digit is:");
digitToFrequencies.forEach(System.out::println);
// Guessing
StringBuilder keyBuilder = new StringBuilder();
for (List<CharacterFrequency> frequencies : digitToFrequencies) {
char mostFrequentChar = frequencies.get(0)
.getCharacter();
int keyInt = mostFrequentChar - mostCommonCharOverall;
keyInt = keyInt >= 0 ? keyInt : keyInt + ALPHABET_SIZE;
char key = (char) (FIRST_CHAR_IN_ALPHABET + keyInt);
keyBuilder.append(key);
}
String key = keyBuilder.toString();
System.out.println("The guessed key is: " + key);
System.out.println("Decrypted message:");
System.out.println(decrypt(encrypted, key));
}
private static String decrypt(String encryptedMessage, String key) {
StringBuilder decryptBuilder = new StringBuilder(encryptedMessage.length());
int digit = 0;
for (char encryptedChar : encryptedMessage.toCharArray())
{
char keyForDigit = key.charAt(digit);
int decryptedCharInt = encryptedChar - keyForDigit;
decryptedCharInt = decryptedCharInt >= 0 ? decryptedCharInt : decryptedCharInt + ALPHABET_SIZE;
char decryptedChar = (char) (decryptedCharInt + FIRST_CHAR_IN_ALPHABET);
decryptBuilder.append(decryptedChar);
digit = (digit + 1) % key.length();
}
return decryptBuilder.toString();
}
private static class CharacterFrequency implements Comparable<CharacterFrequency> {
private final char character;
private final double frequency;
private CharacterFrequency(final char character, final double frequency) {
this.character = character;
this.frequency = frequency;
}
#Override
public int compareTo(final CharacterFrequency o) {
return -1 * Double.compare(frequency, o.frequency);
}
private char getCharacter() {
return character;
}
private double getFrequency() {
return frequency;
}
#Override
public String toString() {
return character + "=" + String.format("%.3f", frequency);
}
}
}
Decrypted
Using above code, the key is:
XHSIHE
And the full decrypted message is:
ERWASNOTCERTAINDISESTEEMSURELYTHENHEMIGHTHAVEREGARDEDTHATABHORRENCEOFTHEUNINTACTSTATEWHICHHEHADINHERITEDWITHTHECREEDOFMYSTICISMASATLEASTOPENTOCORRECTIONWHENTHERESULTWASDUETOTREACHERYAREMORSESTRUCKINTOHIMTHEWORDSOFIZZHUETTNEVERQUITESTILLEDINHISMEMORYCAMEBACKTOHIMHEHADASKEDIZZIFSHELOVEDHIMANDSHEHADREPLIEDINTHEAFFIRMATIVEDIDSHELOVEHIMMORETHANTESSDIDNOSHEHADREPLIEDTESSWOULDLAYDOWNHERLIFEFORHIMANDSHEHERSELFCOULDDONOMOREHETHOUGHTOFTESSASSHEHADAPPEAREDONTHEDAYOFTHEWEDDINGHOWHEREYESHADLINGEREDUPONHIMHOWSHEHADHUNGUPONHISWORDSASIFTHEYWEREAGODSANDDURINGTHETERRIBLEEVENINGOVERTHEHEARTHWHENHERSIMPLESOULUNCOVEREDITSELFTOHISHOWPITIFULHERFACEHADLOOKEDBYTHERAYSOFTHEFIREINHERINABILITYTOREALIZETHATHISLOVEANDPROTECTIONCOULDPOSSIBLYBEWITHDRAWNTHUSFROMBEINGHERCRITICHEGREWTOBEHERADVOCATECYNICALTHINGSHEHADUTTEREDTOHIMSELFABOUTHERBUTNOMANCANBEALWAYSACYNI
Which is more or less valid english text:
er was not certain disesteem surely then he might have regarded that
abhorrence of the unintact state which he had inherited with the creed
of my sticismas at least open to correction when the result was due to
treachery are morse struck into him the words of izz huett never quite
still ed in his memory came back to him he had asked izz if she loved
him and she had replied in the affirmative did she love him more than
tess did no she had replied tess would lay down her life for him and she
herself could do no more he thought of tess as she had appeared on the day
of the wedding how here yes had lingered upon him how she had hung upon
his words as if they were a gods and during the terrible evening over
the hearth when her simple soul uncovered itself to his how pitiful her
face had looked by the rays of the fire inherinability to realize that
his love and protection could possibly be withdrawn thus from being her
critiche grew to be her advocate cynical things he had uttered to
himself about her but noman can be always acyn I
Which, by the way, is a quote from the british novel Tess of the d'Urbervilles: A Pure Woman Faithfully Presented. Phase the Sixth: The Convert, Chapter XLIX.
Standard Vigenere interleaves Caesar shift cyphers, specified by the key. If the Vigenere key is six characters long, then letters 1, 7, 13, ... of the ciphertext are on one Caesar shift -- every sixth character uses the first character of the key. Letter 2, 8, 14 ... of the ciphertext use a different (in general) Caesar shift and so on.
That gives you six different Caesar shift ciphers to solve. The text will not be in English, due to picking every sixth letter, so you will need to solve it by letter frequency. That will give you a few good options for each position of the key. Try them in order of probability to see which gives the correct decryption.
I have a dictionary with many words. And i hope search the longest concatenated word (that is, the longest word that is comprised entirely of
shorter words in the file). I give the method a descending word from their length. How can I check that all the symbols have been used from the dictionary?
public boolean tryMatch(String s, List dictionary) {
String nextWord = new String();
int contaned = 0;
//Цикл перебирающий каждое слово словаря
for(int i = 1; i < dictionary.size();i++) {
nextWord = (String) dictionary.get(i);
if (nextWord == s) {
nextWord = (String) dictionary.get(i + 1);
}
if (s.contains(nextWord)) {
contaned++;
}
}
if(contaned >1) {
return true;
}
return false;
}
If you have a sorted list of words, finding compound words is easy, but it will only perform well if the words are in a Set.
Let's look at the compound word football, and of course assume that both ball and foot are in the work list.
By definition, any compound word using foot as the first sub-word must start with foot.
So, when iterating the list, remember the current active "stem" words, e.g. when seeing foot, remember it.
Now, when seeing football, you check if the word starts with the stem word. If not, clear the stem word, and make new word the stem word.
If it does, the new word (football) is a candidate for being a compound word. The part after the stem is ball, so we need to check if that is a word, and if so, we found a compound word.
Checking is easy for simple case, i.e. wordSet.contains(remain).
However, compound words can be made up of more than 2 words, e.g. whatsoever. So after finding that it is a candidate from the stem word what, the remain is soever.
You can simply try all lengths of that (soever, soeve, soev, soe, so, s), and if one of the shorter ones are words, you repeat the process.
I just attempted a programming challenge, which I was not able to successfully complete. The specification is to read 2 lines of input from System.in.
A list of 1-100 space separated words, all of the same length and between 1-10 characters.
A string up to a million characters in length, which contains a permutation of the above list just once. Return the index of where this permutation begins in the string.
For example, we may have:
dog cat rat
abcratdogcattgh
3
Where 3 is the result (as printed by System.out).
It's legal to have a duplicated word in the list:
dog cat rat cat
abccatratdogzzzzdogcatratcat
16
The code that I produced worked providing that the word that the answer begins with has not occurred previously. In the 2nd example here, my code will fail because dog has already appeared before where the answer begins at index 16.
My theory was to:
Find the index where each word occurs in the string
Extract this substring (as we have a number of known words with a known length, this is possible)
Check that all of the words occur in the substring
If they do, return the index that this substring occurs in the original string
Here is my code (it should be compilable):
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
String[] l = line.split(" ");
String s = br.readLine();
int wl = l[0].length();
int len = wl * l.length;
int sl = s.length();
for (String word : l) {
int i = s.indexOf(word);
int z = i;
//while (i != -1) {
int y = i + len;
if (y <= sl) {
String sub = s.substring(i, y);
if (containsAllWords(l, sub)) {
System.out.println(s.indexOf(sub));
System.exit(0);
}
}
//z+= wl;
//i = s.indexOf(word, z);
//}
}
System.out.println("-1");
}
private static boolean containsAllWords(String[] l, String s) {
String s2 = s;
for (String word : l) {
s2 = s2.replaceFirst(word, "");
}
if (s2.equals(""))
return true;
return false;
}
}
I am able to solve my issue and make it pass the 2nd example by un-commenting the while loop. However this has serious performance implications. When we have an input of 100 words at 10 characters each and a string of 1000000 characters, the time taken to complete is just awful.
Given that each case in the test bench has a maximum execution time, the addition of the while loop would cause the test to fail on the basis of not completing the execution in time.
What would be a better way to approach and solve this problem? I feel defeated.
If you concatenate the strings together and use the new string to search with.
String a = "dog"
String b = "cat"
String c = a+b; //output of c would be "dogcat"
Like this you would overcome the problem of dog appearing somewhere.
But this wouldn't work if catdog is a valid value too.
Here is an approach (pseudo code)
stringArray keys(n) = {"cat", "dog", "rat", "roo", ...};
string bigString(1000000);
L = strlen(keys[0]); // since all are same length
int indices(n, 1000000/L); // much too big - but safe if only one word repeated over and over
for each s in keys
f = -1
do:
f = find s in bigString starting at f+1 // use bigString.indexOf(s, f+1)
write index of f to indices
until no more found
When you are all done, you will have a series of indices (location of first letter of match). Now comes the tricky part. Since the words are all the same length, we're looking for a sequence of indices that are all spaced the same way, in the 10 different "collections". This is a little bit tedious but it should complete in a finite time. Note that it's faster to do it this way than to keep comparing strings (comparing numbers is faster than making sure a complete string is matched, obviously). I would again break it into two parts - first find "any sequence of 10 matches", then "see if this is a unique permutation".
sIndx = sort(indices(:))
dsIndx = diff(sIndx);
sequence = find {n} * 10 in dsIndx
for each s in sequence
check if unique permutation
I hope this gets you going.
Perhaps not the best optimized version, but how about following theory to give you some ideas:
Count length of all words in row.
Take random word from list and find the starting index of its first
occurence.
Take a substring with length counted above before and after that
index (e.g. if index is 15 and 3 words of 4 letters long, take
substring from 15-8 to 15+11).
Make a copy of the word list with earlier random word removed.
Check the appending/prepending [word_length] letters to see if they
match a new word on the list.
If word matches copy of list, remove it from copy of list and move further
If all words found, break loop.
If not all words found, find starting index of next occurence of
earlier random word and go back to 3.
Why it would help:
Which word you pick to begin with wouldn't matter, since every word
needs to be in the succcessful match anyway.
You don't have to manually loop through a lot of the characters,
unless there are lots of near complete false matches.
As a supposed match keeps growing, you have less words on the list copy left to compare to.
Can also keep track or furthest index you've gone to, so you can
sometimes limit the backwards length of picked substring (as it
cannot overlap to where you've already been, if the occurence are
closeby to each other).
I was asked in interview following question. I could not figure out how to approach this question. Please guide me.
Question: How to know whether a string can be segmented into two strings - like breadbanana is segmentable into bread and banana, while breadbanan is not. You will be given a dictionary which contains all the valid words.
Build a trie of the words you have in the dictionary, which will make searching faster.
Search the tree according to the following letters of your input string. When you've found a word, which is in the tree, recursively start from the position after that word in the input string. If you get to the end of the input string, you've found one possible fragmentation. If you got stuck, come back and recursively try another words.
EDIT: sorry, missed the fact, that there must be just two words.
In this case, limit the recursion depth to 2.
The pseudocode for 2 words would be:
T = trie of words in the dictionary
for every word in T, which can be found going down the tree by choosing the next letter of the input string each time we move to the child:
p <- length(word)
if T contains input_string[p:length(intput_string)]:
return true
return false
Assuming you can go down to a child node in the trie in O(1) (ascii indexes of children), you can find all prefixes of the input string in O(n+p), where p is the number of prefixes, and n the length of the input. Upper bound on this is O(n+m), where m is the number of words in dictionary. Checking for containing will take O(w) where w is the length of word, for which the upper bound would be m, so the time complexity of the algorithm is O(nm), since O(n) is distributed in the first phase between all found words.
But because we can't find more than n words in the first phase, the complexity is also limited to O(n^2).
So the search complexity would be O(n*min(n, m))
Before that you need to build the trie which will take O(s), where s is the sum of lengths of words in the dictionary. The upper bound on this is O(n*m), since the maximum length of every word is n.
you go through your dictionary and compare every term as a substring with the original term e.g. "breadbanana". If the first term matches with the first substring, cut the first term out of the original search term and compare the next dictionary entries with the rest of the original term...
let me try to explain that in java:
e.g.
String dictTerm = "bread";
String original = "breadbanana";
// first part matches
if (dictTerm.equals(original.substring(0, dictTerm.length()))) {
// first part matches, get the rest
String lastPart = original.substring(dictTerm.length());
String nextDictTerm = "banana";
if (nextDictTerm.equals(lastPart)) {
System.out.println("String " + original +
" contains the dictionary terms " +
dictTerm + " and " + lastPart);
}
}
The simplest solution:
Split the string between every pair of consecutive characters and see whether or not both substrings (to the left of the split point and to the right of it) are in the dictionary.
One approach could be:
Put all elements of dictionary in some set or list
now you can use contains & substring function to remove words which matches dictionary. if at the end string is null -> string can be segmented else not. You can also take care of count.
public boolean canBeSegmented(String s) {
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
}
return s.equals("");
}
}
This code checks if your given String can be fully segmented. It checks if a word from the dictionary is inside your string and then subtracks it. If you want to segment it in the process you have to order the subtracted sementents in the order they are inside the word.
Just two words makes it easier:
public boolean canBeSegmented(String s) {
boolean wordDetected = false;
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
if(!wordDetected)
wordDetected = true;
else
return s.equals("");
}
return false;
}
}
This code checks for one Word and if there is another word in the String and just these two words it returns true otherwise false.
this is a mere idea , you can implement it better if you want
package farzi;
import java.util.ArrayList;
public class StringPossibility {
public static void main(String[] args) {
String str = "breadbanana";
ArrayList<String> dict = new ArrayList<String>();
dict.add("bread");
dict.add("banana");
for(int i=0;i<str.length();i++)
{
String word1 = str.substring(0,i);
String word2 = str.substring(i,str.length());
System.out.println(word1+"===>>>"+word2);
if(dict.contains(word1))
{
System.out.println("word 1 found : "+word1+" at index "+i);
}
if(dict.contains(word2))
{
System.out.println("word 2 found : "+ word2+" at index "+i);
}
}
}
}