Shortening a string in Java

Shortening a string in Java - java

I have a requirement to shorten a 6 character string like "ABC123" into a unique 4 character string. It has to be repeatable so that the input string will always generate the same output string. Does anyone have any ideals how to do this?

It is not possible to do a fully unique mapping from a 6 character string to a 4 character string. This is an example of a simple hash function. Because the range space is smaller than the domain space, you are necessarily going to have some hash collisions. You can try to minimize the number of collisions based on the type of data you're going to be accepting, but ultimately it's impossible to map every 6 character string to a unique 4 character string, you would run out of 4 character strings.

You need some restrictions on the input string, otherwise math will inevitably bite you.
For example, let's assume you know that it consists of upper case letters and digits only. Therefore, there are 36^6 possible input strings.
The result needs to have less restrictions, e.g. you allow 216 different characters (printable extended ascii or something like that).
By pure coincidence, 216^4 = 36^6, so what you need is a mapping. That's easy, just use the algorithm for converting number representations from one radix to another.

Not sure this can be done, as I would bet there are some business constraints (like a user has to be able to type in the key).
The idea is to "hash" down the value into a smaller number of places. This requires a character set large enough to handle all combinations.
Let's assume the original key is case insensitive, you have 26 + 10 = 32, then raised to the 6th unique combinations (2,176,782,336 unique combinations). To match this in only 4 characters, you have to use a character set with 216 unique characters, as 216 ^ 6 is 2,176,782,336 or the first number raise to 4 with more combinations than a case insensitive key with numbers. (case insentivity, plus numerics only takes you to 62).
If we take the standard US keyboard, we have 26 letters x 2 cases = 52
10 numbers
10 special characters on number keys
11 other special character keys * 2 = 22
This is 94 unique characters, or less than half the uniques you need just to get a case insensitive 6 digit code into 4 digits. Now, on the Planet Klingon, where keyboards are much larger ... ;-)
If the key is case insensitive, your character set has to expand to 489 unique characters to fit in a 4 digit "hash". Ouch!

Assumption: The input string can only have characters with ASCII decimal values below 128... otherwise, as others have stated, this wont work.
public class Foo {
public static int crunch(String str) {
int y = 0;
int limit = str.length() > 6 ? 6 : str.length();
for (int i = 0; i < limit; ++i) {
y += str.charAt(i) * (limit - i);
}
return y;
}
public static void main(String[] args) {
String[] words = new String[]{
"abcdef", "acdefb", "fedcba", "}}}}}}", "ZZZZZZ", "123", "!"
};
for (int idx = 0; idx < words.length; ++idx) {
System.out.printf("in=%-6s out=%04d\n",
words[idx], crunch(words[idx]));
}
}
}
Generates:
in=abcdef out=2072
in=acdefb out=2082
in=fedcba out=2107
in=}}}}}} out=2625
in=ZZZZZZ out=1890
in= 123 out=0298
in= ! out=0033

You have to make assumptions about the range of values the characters can have and when is an acceptable encoded character. There are any number of ways you can do this. You could pack the String to 1,2,3,4 or 5 characters depending on your assumptions.
One simple example which would give you 4 characters is to assume the last three letters are a number.
public static String pack(String text) {
return text.substring(0, 3) + (char) Integer.parseInt(text.substring(3));
}
public static String unpack(String text) {
return text.substring(0, 3) + ("" + (1000 + text.charAt(3))).substring(1);
}
public static void main(String[] args) throws IOException {
String text = "ABC123";
String packed = pack(text);
System.out.println("packed length= " + packed.length());
String unpacked = unpack(packed);
System.out.println("unpacked= '" + unpacked + '\'');
}
prints
packed length= 4
unpacked= 'ABC123'

Related

Extremely compact UUID (using all alphanumeric characters)

I need an extremely compact UUID, the shorter the better.
To that end, I wrote:
public String getBase36UIID() {
// More compact version of UUID
String strUUID = UUID.randomUUID().toString().replace("-", "");
return new BigInteger(strUUID, 16).toString(36);
}
By executing this code, I get, for example:
5luppaye6086d5wp4fqyz57xb
That's good, but it's not the best. Base 36 uses all numeric digits and lowercase letters, but does not use uppercase letters.
If it were possible to use uppercase letters as separate digits from lowercase letters, it would be possible to theorize a numerical base 62, composed of these digits:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
I could theorize numerical bases also using accented characters such as "è" or "é", or special characters such as "$" or "!", further increasing the number of digits available.
The use of these accented or special characters, however, may cause me problems, so for the moment I prefer not to consider them.
After all these premises, how can I convert the BigInteger representing my UUID into the base 62 above theorized, in order to make it even more compact? Thanks
I have already verified that a code like the following is not usable, because every base over 36 is treated as base 10:
return new BigInteger(strUUID, 16).toString(62);
After all, in mathematics there is no base 62 as I imagined it, but I suppose that in Java it can be created.

The general algorithm for converting a number to any base is based on division with remainder.
You start by dividing the number by the base. The remainder gives you the last digit of the number - you map it to a symbol. If the quotient is nonzero, you divide it by the base. The remainder gives you the second to last digit. And you repeat the process with the quotient.
In Java, with BigInteger:
String toBase62(BigInteger number) {
String symbols = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
BigInteger base = BigInteger.valueOf(symbols.length());
StringBuilder result = new StringBuilder();
do {
BigInteger[] quotientAndRemainder = number.divideAndRemainder(base);
number = quotientAndRemainder[0];
result.append(symbols.charAt(quotientAndRemainder[1].intValue()));
} while (number.compareTo(BigInteger.ZERO) > 0);
return result.reverse().toString();
}
Do you need the identifier to be a UUID though? Couldn't it be just any random sequence of letters and numbers? If that's acceptable, you don't have to deal with number base conversions.
String randomString(int length) {
String symbols = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
Random rnd = new Random();
StringBuilder str = new StringBuilder();
for (int i = 0; i < length; i++) {
str.append(symbols.charAt(rnd.nextInt(symbols.length())));
}
return str.toString();
}

This should not be difficult. Converting a number to a string is a basic programming task. The fact that you're using base 62 makes no difference.
Decide how many characters you're willing to use, and then convert your large number to that base. Map each "digit" onto one of the characters.
Pseudocode:
b = the base (say, 62)
valid_chars = an array of 'b' characters
u = the uuid
while u != 0:
digit = u % b;
char = valid_chars[digit];
u = u / b;
This produces the digits right-to-left but you should get the idea.

Main idea is the same as previous posts, but the implementation have some differences.
Also note that if wanted different occurrence probability for each chars this can be adjusted also.(mainly add a character more time on a data structure and change his probability)
Here is fair-probability for each chars (equals, 1/62)
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class RCode {
String symbols = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static void main(String[] args)
{
RCode r = new RCode();
System.out.println("symbols="+r.symbols.length());
System.out.println("code_10(+1)="+r.generate(10));
System.out.println("code_70(+2)="+r.generate(70));
//System.out.println("code_124(+3)="+r.generate(124));
}
public String generate(int length)
{
int num = length/symbols.length()+1;
List<Character> list = new ArrayList<Character>();
for(int i=0; i<symbols.length(); i++)
{
//if needed to change probability of char occurrence then adapt here
for(int j=0;j<=num;j++)
{
list.add(symbols.charAt(i));
}
}
//basically is the same as random
Collections.shuffle(list);
StringBuffer sb = new StringBuffer();
for(int i=0; i<length; i++)
{
sb.append(list.get(i));
}
return sb.toString();
}
}
Output:
symbols=62
//each char is added once(+1)
code_10(+1)=hFW9ZFEAeU
code_70(+2)=hrHQCEdQ3F28apcJPnfjAaOu55Xso12xabkJ7MrU97U0HYkYhWwGEqVAiLOp3X3QSuq6qp
Note: Algorithm have a defect, just try to figured out why the sequence will be never generate on 10 (aaaaaaaaaa). Easy to fix ... but i was focused on the idea.
Now, as it is, basically is generating up to num each character. (random and maybe for someone will be useful the output)

Breaking Vigenere only knowing key length

Problem
I want to decode a message encrypted with classic Viginere. I know that the key has a length of exactly 6 characters.
The message is:
BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM
Question
I tried a brute-force approach but unfortunately this yields an extreme amount of combinations, too many to compute.
Do you have any idea how to go from here or how to approach this problem in general?
Attempt
Here is what i have so far:
public class Main {
// instance variables - replace the example below with your own
private String message;
private String answer;
private String first;
/**
* Constructor for objects of class Main
*/
public Main()
{
// initialise instance variables
message ="BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";
for (int x = 0; x < message.length() / 6; x++) {
int index = x * 6;
first = new StringBuilder()
.append(first)
.append(message.charAt(index))
.toString();
}
System.out.println(first);
}
}

Non-text message
In case the raw message is not actual text (like english text that makes sense) or you have no information about its content, you will be out of luck.
Especially if the text is actually hashed or double-encrypted, i.e. random stuff.
Breaking an encryption scheme requires knowledge about the algorithm and the messages. Especially in your situation, you will need to know the general structure of your messages in order to break it.
Prerequisites
For the rest of this answer, let me assume your message is actually plain english text. Note that you can easily adopt my answer to other languages. Or even adopt the techniques to other message formats.
Let me also assume that you are talking about classic Vigenere (see Wikipedia) and not about one of its many variants. That means that your input consists only of the letters A to Z, no case, no interpunction, no spaces. Example:
MYNAMEISJOHN // Instead of: My name is John.
The same also applies to your key, it only contains A to Z.
Classic Viginere then shifts by the offset in the alphabet, modulo the alphabet size (which is 26).
Example:
(G + L) % 26 = R
Dictionary
Before we talk about attacks we need to find a way to, given a generated key, find out whether it is actually correct or not.
Since we know that the message consists of english text, we can just take a dictionary (a huge list of all valid english words) and compare our decrypted message against the dictionary. If the key was wrong, the resulting message will not contain valid words (or only a few).
This can be a bit tricky since we lack interpunction (in particular, no spaces).
N-grams
Good thing that there is actually a very accurate way of measuring how valid a text is, which also solves the issue with the missing interpunction.
The technique is called N-grams (see Wikipedia). You choose a value for N, for example 3 (then called tri-grams) and start splitting your text into pairs of 3 characters. Example:
MYNAMEISJOHN // results in the trigrams:
$$M, $$MY, MYN, YNA, NAM, AME, MEI, ISJ, SJO, JOH, OHN, HN$, N$$
What you need now is a frequency analysis of the most common tri-grams in english text. There exist various sources online (or you can run it yourself on a big text corpus).
Then you simply compare your tri-gram frequency to the frequency for real text. Using that, you compute a score of how well your frequency matches the real frequency. If your message contains a lot of very uncommon tri-grams, it is highly likely to be garbage data and not real text.
A small note, mono-grams (1-gram) result in a single character frequency (see Wikipedia#Letter frequency). Bi-grams (2-gram) are used commonly for cracking Viginere and yield good results.
Attacks
Brute-Force
The first and most straightforward attack is always brute-force. And, as long as the key and the alphabet is not that big, the amount of combinations is relatively low.
Your key has length 6, the alphabet has size 26. So the amount of different key combinations is 6^26, which is
170_581_728_179_578_208_256
So about 10^20. This number might appear huge, but do not forget that CPUs operate already in the Gigahertz range (10^9 operations per second, per core). That means that a single core with 1 GHz will have generated all solutions in about 317 years. Now replace that by a powerful CPU (or even GPU) and with a multi-core machine (there are clusters with millions of cores), then this is computed in less than a day.
But okay, I get that you most likely do not have access to such a hardcore cluster. So a full brute-force is not feasible.
But do not worry. There are simple tricks to speed this up. You do not have to compute the full key. How about limiting yourself to the first 3 characters instead of the full 6 characters. You will only be able to decrypt a subset of the text then, but it is enough to analyze whether the outcome is valid text or not (using dictionaries and N-grams, as mentioned before).
This small change already drastically cuts down computation time since you then only have 3^26 combinations. Generating those takes around 2 minutes for a single 1 GHz core.
But you can do even more. Some characters are extremely rare in english text, for example Z. You can simply start by not considering keys that would translate to those values in the text. Let us say you remove the 6 least common characters by that, then your combinations are only 3^20. This takes around 100 milliseconds for a single 1 GHz core. Yes, milliseconds. That is fast enough for your average laptop.
Frequency Attack
Enough brute-force, let us do something clever. A letter frequency attack is a very common attack against those encryption schemes. It is simple, extremely fast and very successful. In fact, it is so simple that there are quite some online tools that offer this for free, for example guballa.de/vigenere-solver (it is able to crack your specific example, I just tried it out).
While Viginere changes the message to unreadable garbage, it does not change the distribution of letters, at least not per digit of the key. So if you look at, let's say the second digit of your key, from there on, every sixth letter (length of the key) in the message will be shifted by the exact same offset.
Let us take a look at a simple example. The key is BAC and the message is
CCC CCC CCC CCC CCC // raw
DCF DCF DCF DCF DCF // decrypted
As you notice, the letters repeat. Looking at the third letter, it is always F. So that means that the sixth and ninth letter, which are also F, all must be the exact same original letter. Since they where all shifted by the C from the key.
That is a very important observation. It means that letter frequency is, within a multiple of a digit of the key (k * (i + key_length)), preserved.
Let us now take a look at the letter distribution in english text (from Wikipedia):
All you have to do now is to split your message into its blocks (modulo key-length) and do a frequency analysis per digit of the blocks.
So for your specific input, this yields the blocks
BYOIZR
LAUMYX
XPFLPW
BZLMLQ
PBJMSC
...
Now you analyze the frequency for digit 1 of each block, then digit 2, and so on, until digit 6. For the first digit, this are the letters
B, L, X, B, P, ...
The result for your specific input is:
[B=0.150, E=0.107, X=0.093, L=0.079, Q=0.079, P=0.071, K=0.064, I=0.050, O=0.050, R=0.043, F=0.036, J=0.036, A=0.029, S=0.029, Y=0.021, Z=0.021, C=0.014, T=0.014, D=0.007, V=0.007]
[L=0.129, O=0.100, H=0.093, A=0.079, V=0.071, Y=0.071, B=0.057, K=0.057, U=0.050, F=0.043, P=0.043, S=0.043, Z=0.043, D=0.029, W=0.029, N=0.021, C=0.014, I=0.014, J=0.007, T=0.007]
[W=0.157, Z=0.093, K=0.079, L=0.079, V=0.079, A=0.071, G=0.071, J=0.064, O=0.050, X=0.050, D=0.043, U=0.043, S=0.036, Q=0.021, E=0.014, F=0.014, N=0.014, M=0.007, T=0.007, Y=0.007]
[M=0.150, P=0.100, Q=0.100, I=0.079, B=0.071, Z=0.071, L=0.064, W=0.064, K=0.057, V=0.043, E=0.036, A=0.029, C=0.029, N=0.029, U=0.021, H=0.014, S=0.014, D=0.007, G=0.007, J=0.007, T=0.007]
[L=0.136, Y=0.100, A=0.086, O=0.086, P=0.086, U=0.086, H=0.064, K=0.057, V=0.050, Z=0.050, S=0.043, J=0.029, M=0.021, T=0.021, W=0.021, G=0.014, I=0.014, B=0.007, C=0.007, N=0.007, R=0.007, X=0.007]
[I=0.129, M=0.107, X=0.100, L=0.086, W=0.079, S=0.064, R=0.057, H=0.050, Q=0.050, K=0.043, E=0.036, C=0.029, T=0.029, V=0.029, F=0.021, J=0.021, P=0.021, G=0.014, Y=0.014, A=0.007, D=0.007, O=0.007]
Look at it. You see that for the first digit the letter B is very common, 15%. And then letter E with 10% and so on. There is a high chance that letter B, for the first digit of the key, is an alias for E in the real text (since E is the most common letter in english text) and that the E stands for the second most common letter, namely T.
Using that you can easily reverse-compute the letter of the key used for encryption. It is obtained by
B - E % 26 = X
Note that your message distribution might not necessary align with the real distribution over all english text. Especially if the message is not that long (the longer, the more accurate is the distribution computation) or mainly consists of weird and unusual words.
You can counter that by trying out a few combinations among the highest of your distribution. So for the first digit you could try out whether
B -> E
E -> E
X -> E
L -> E
Or instead of mapping to E only, also try out the second most common character T:
B -> T
E -> T
X -> T
L -> T
The amount of combinations you get with that is very low. Use dictionaries and N-grams (as mentioned before) to validate whether the key is correct or not.
Java Implementation
Your message is actually very interesting. It perfectly aligns with the real letter frequency over english text. So for your particular case you actually do not need to try out any combinations, nor do you need to do any dictionary/n-gram checks. You can actually just translate the most common letter in your encrypted message (per digit) to the most common character in english text, E, and get the real actual key.
Since that is so simple and trivial, here is a full implementation in Java for what I explained before step by step, with some debug outputs (it is a quick prototype, not really nicely structured):
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public final class CrackViginere {
private static final int ALPHABET_SIZE = 26;
private static final char FIRST_CHAR_IN_ALPHABET = 'A';
public static void main(final String[] args) {
String encrypted =
"BYOIZRLAUMYXXPFLPWBZLMLQPBJMSCQOWVOIJPYPALXCWZLKXYVMKXEHLIILLYJMUGBVXBOIRUAVAEZAKBHXBDZQJLELZIKMKOWZPXBKOQALQOWKYIBKGNTCPAAKPWJHKIAPBHKBVTBULWJSOYWKAMLUOPLRQOWZLWRSLEHWABWBVXOLSKOIOFSZLQLYKMZXOBUSPRQVZQTXELOWYHPVXQGDEBWBARBCWZXYFAWAAMISWLPREPKULQLYQKHQBISKRXLOAUOIEHVIZOBKAHGMCZZMSSSLVPPQXUVAOIEHVZLTIPWLPRQOWIMJFYEIAMSLVQKWELDWCIEPEUVVBAZIUXBZKLPHKVKPLLXKJMWPFLVBLWPDGCSHIHQLVAKOWZSMCLXWYLFTSVKWELZMYWBSXKVYIKVWUSJVJMOIQOGCNLQVXBLWPHKAOIEHVIWTBHJMKSKAZMKEVVXBOITLVLPRDOGEOIOLQMZLXKDQUKBYWLBTLUZQTLLDKPLLXKZCUKRWGVOMPDGZKWXZANALBFOMYIXNGLZEKKVCYMKNLPLXBYJQIPBLNMUMKNGDLVQOWPLEOAZEOIKOWZZMJWDMZSRSMVJSSLJMKMQZWTMXLOAAOSTWABPJRSZMYJXJWPHHIVGSLHYFLPLVXFKWMXELXQYIFUZMYMKHTQSMQFLWYIXSAHLXEHLPPWIVNMHRAWJWAIZAAWUGLBDLWSPZAJSCYLOQALAYSEUXEBKNYSJIWQUKELJKYMQPUPLKOLOBVFBOWZHHSVUIAIZFFQJEIAZQUKPOWPHHRALMYIAAGPPQPLDNHFLBLPLVYBLVVQXUUIUFBHDEHCPHUGUM";
int keyLength = 6;
char mostCommonCharOverall = 'E';
// Blocks
List<String> blocks = new ArrayList<>();
for (int startIndex = 0; startIndex < encrypted.length(); startIndex += keyLength) {
int endIndex = Math.min(startIndex + keyLength, encrypted.length());
String block = encrypted.substring(startIndex, endIndex);
blocks.add(block);
}
System.out.println("Individual blocks are:");
blocks.forEach(System.out::println);
// Frequency
List<Map<Character, Integer>> digitToCounts = Stream.generate(HashMap<Character, Integer>::new)
.limit(keyLength)
.collect(Collectors.toList());
for (String block : blocks) {
for (int i = 0; i < block.length(); i++) {
char c = block.charAt(i);
Map<Character, Integer> counts = digitToCounts.get(i);
counts.compute(c, (character, count) -> count == null ? 1 : count + 1);
}
}
List<List<CharacterFrequency>> digitToFrequencies = new ArrayList<>();
for (Map<Character, Integer> counts : digitToCounts) {
int totalCharacterCount = counts.values()
.stream()
.mapToInt(Integer::intValue)
.sum();
List<CharacterFrequency> frequencies = new ArrayList<>();
for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
double frequency = entry.getValue() / (double) totalCharacterCount;
frequencies.add(new CharacterFrequency(entry.getKey(), frequency));
}
Collections.sort(frequencies);
digitToFrequencies.add(frequencies);
}
System.out.println("Frequency distribution for each digit is:");
digitToFrequencies.forEach(System.out::println);
// Guessing
StringBuilder keyBuilder = new StringBuilder();
for (List<CharacterFrequency> frequencies : digitToFrequencies) {
char mostFrequentChar = frequencies.get(0)
.getCharacter();
int keyInt = mostFrequentChar - mostCommonCharOverall;
keyInt = keyInt >= 0 ? keyInt : keyInt + ALPHABET_SIZE;
char key = (char) (FIRST_CHAR_IN_ALPHABET + keyInt);
keyBuilder.append(key);
}
String key = keyBuilder.toString();
System.out.println("The guessed key is: " + key);
System.out.println("Decrypted message:");
System.out.println(decrypt(encrypted, key));
}
private static String decrypt(String encryptedMessage, String key) {
StringBuilder decryptBuilder = new StringBuilder(encryptedMessage.length());
int digit = 0;
for (char encryptedChar : encryptedMessage.toCharArray())
{
char keyForDigit = key.charAt(digit);
int decryptedCharInt = encryptedChar - keyForDigit;
decryptedCharInt = decryptedCharInt >= 0 ? decryptedCharInt : decryptedCharInt + ALPHABET_SIZE;
char decryptedChar = (char) (decryptedCharInt + FIRST_CHAR_IN_ALPHABET);
decryptBuilder.append(decryptedChar);
digit = (digit + 1) % key.length();
}
return decryptBuilder.toString();
}
private static class CharacterFrequency implements Comparable<CharacterFrequency> {
private final char character;
private final double frequency;
private CharacterFrequency(final char character, final double frequency) {
this.character = character;
this.frequency = frequency;
}
#Override
public int compareTo(final CharacterFrequency o) {
return -1 * Double.compare(frequency, o.frequency);
}
private char getCharacter() {
return character;
}
private double getFrequency() {
return frequency;
}
#Override
public String toString() {
return character + "=" + String.format("%.3f", frequency);
}
}
}
Decrypted
Using above code, the key is:
XHSIHE
And the full decrypted message is:
ERWASNOTCERTAINDISESTEEMSURELYTHENHEMIGHTHAVEREGARDEDTHATABHORRENCEOFTHEUNINTACTSTATEWHICHHEHADINHERITEDWITHTHECREEDOFMYSTICISMASATLEASTOPENTOCORRECTIONWHENTHERESULTWASDUETOTREACHERYAREMORSESTRUCKINTOHIMTHEWORDSOFIZZHUETTNEVERQUITESTILLEDINHISMEMORYCAMEBACKTOHIMHEHADASKEDIZZIFSHELOVEDHIMANDSHEHADREPLIEDINTHEAFFIRMATIVEDIDSHELOVEHIMMORETHANTESSDIDNOSHEHADREPLIEDTESSWOULDLAYDOWNHERLIFEFORHIMANDSHEHERSELFCOULDDONOMOREHETHOUGHTOFTESSASSHEHADAPPEAREDONTHEDAYOFTHEWEDDINGHOWHEREYESHADLINGEREDUPONHIMHOWSHEHADHUNGUPONHISWORDSASIFTHEYWEREAGODSANDDURINGTHETERRIBLEEVENINGOVERTHEHEARTHWHENHERSIMPLESOULUNCOVEREDITSELFTOHISHOWPITIFULHERFACEHADLOOKEDBYTHERAYSOFTHEFIREINHERINABILITYTOREALIZETHATHISLOVEANDPROTECTIONCOULDPOSSIBLYBEWITHDRAWNTHUSFROMBEINGHERCRITICHEGREWTOBEHERADVOCATECYNICALTHINGSHEHADUTTEREDTOHIMSELFABOUTHERBUTNOMANCANBEALWAYSACYNI
Which is more or less valid english text:
er was not certain disesteem surely then he might have regarded that
abhorrence of the unintact state which he had inherited with the creed
of my sticismas at least open to correction when the result was due to
treachery are morse struck into him the words of izz huett never quite
still ed in his memory came back to him he had asked izz if she loved
him and she had replied in the affirmative did she love him more than
tess did no she had replied tess would lay down her life for him and she
herself could do no more he thought of tess as she had appeared on the day
of the wedding how here yes had lingered upon him how she had hung upon
his words as if they were a gods and during the terrible evening over
the hearth when her simple soul uncovered itself to his how pitiful her
face had looked by the rays of the fire inherinability to realize that
his love and protection could possibly be withdrawn thus from being her
critiche grew to be her advocate cynical things he had uttered to
himself about her but noman can be always acyn I
Which, by the way, is a quote from the british novel Tess of the d'Urbervilles: A Pure Woman Faithfully Presented. Phase the Sixth: The Convert, Chapter XLIX.

Standard Vigenere interleaves Caesar shift cyphers, specified by the key. If the Vigenere key is six characters long, then letters 1, 7, 13, ... of the ciphertext are on one Caesar shift -- every sixth character uses the first character of the key. Letter 2, 8, 14 ... of the ciphertext use a different (in general) Caesar shift and so on.
That gives you six different Caesar shift ciphers to solve. The text will not be in English, due to picking every sixth letter, so you will need to solve it by letter frequency. That will give you a few good options for each position of the key. Try them in order of probability to see which gives the correct decryption.

Converting letters to alphabet position with two JLists

I'm trying to replace all words (alphabet letters) from JList1 to the number corresponding its place in the alphabet to JList2 with the press of the Run button. (ex. A to 01) And if it's not an English alphabet letter then leaving it as it is. Capitalization doesn't matter (a and A is still 01) and spaces should be kept.
For visual purposes:
"Apple!" should be converted to "0116161205!"
"stack Overflow" to "1920010311 1522051806121523"
"über" to "ü020518"
I have tried a few methods I found on here, but had zero clue how to add the extra 0 in front of the first 9 letters or keep the spaces. Any help is much appreciated.

Here is a solution :
//Create a Map of character and equivalent number
Map<Character, String> lettersToNumber = new HashMap<>();
int i = 1;
for(char c = 'a'; c <= 'z'; c++) {
lettersToNumber.put(c, String.format("%02d", i++));
}
//Loop over the characters of your input and the corresponding number
String result = "";
for(char c : "Apple!".toCharArray()) {
char x = Character.toLowerCase(c);
result+= lettersToNumber.containsKey(x) ? lettersToNumber.get(x) : c;
}
Input, Output
Apple! => 0116161205!
stack Overflow => 1920010311 1522051806121523
über => ü020518

So given...
(ex. A to 01) And if it's not an English alphabet letter then leaving it as it is. Capitalization doesn't matter (a and A is still 01) and spaces should be kept.
This raises some interesting points:
We don't care about non-english characters, so we can dispense with issues around UTF encoding
Capitalization doesn't matter
Spaces should be kept
The reason these points are interesting to me is it means we're only interested in a small subset of characters (1-26). This immediately screams "ASCII" to me!
This provides an immediate lookup table which doesn't require us to produce anything up front, it's immediately accessible.
A quick look at any ascii table provides us with all the information we need. A-Z is in the range of 65-90 (since we don't care about case, we don't need to worry about the lower case range.
But how does that help us!?
Well, this now means the primary question becomes, "How do we convert a char to an int?", which is amazingly simple! A char can be both a "character" and a "number" at the same time, because of the ASCII encoding support!
So if you were to print out (int)'A', it would print 65! And since all the characters are in order, we just need to subtract 64 from 65 to get 1!
That's basically your entire problem solved right there!
Oh, okay, you need to deal with the edge cases of characters not falling between A-Z, but that's just a simple if statement
A solution based on the above "might" look something like...
public static String convert(String text) {
int offset = 64;
StringBuilder sb = new StringBuilder(32);
for (char c : text.toCharArray()) {
char input = Character.toUpperCase(c);
int value = ((int) input) - offset;
if (value < 1 || value > 25) {
sb.append(c);
} else {
sb.append(String.format("%02d", value));
}
}
return sb.toString();
}
Now, there are a number of ways you might approach this, I've chosen a path based on my understanding of the problem and my experience.
And based on your example input...
String[] test = {"Apple!", "stack Overflow", "über"};
for (String value : test) {
System.out.println(value + " = " + convert(value));
}
would produce the following output...
Apple! = 0116161205!
stack Overflow = 1920010311 1522051806121523
über = ü020518

Generate all Palindromic numbers in a given number system?

I need to generate all palindromic numbers for a given number base (which should be able to be of size up to 10,000), in a given range. I need a efficient way to do it.
I stumbled upon this answer, which is related to base 10 directly. I'm trying to adapt it to work for "all" bases:
public static Set<String> allPalindromic(long limit, int base, char[] list) {
Set<String> result = new HashSet<String>();
for (long i = 0; i <= base-1 && i <= limit; i++) {
result.add(convert(i, base, list));
}
boolean cont = true;
for (long i = 1; cont; i++) {
StringBuffer rev = new StringBuffer("" + convert(i, base, list)).reverse();
cont = false;
for (char d : list) {
String n = "" + convert(i, base, list) + d + rev;
if (convertBack(n, base, list) <= limit) {
cont = true;
result.add(n);
}
}
}
return result;
}
convert() method converts a number to a string representation of that number in a given base using a list of chars for digits.
convertBack() converts back the string representation of a number to base 10.
When testing my method for base 10, it leaves out two-digit palindromes and then the next ones it leaves out are 1001,1111,1221... and so on.
I'm not sure why.
Here are the conversion methods if needed.
Turns out, this gets slower with my other code because of constant conversions since I need the all numbers in order and in decimal. I'll just stick to iterating over every integer and converting it to every base and then checking if its a palindrome.

I don't have enough reputation to comment, but if you are only missing even length palindromes, then most probably there is something wrong with your list. Most probably you have forgot to add an empty entry in list as to generate 1001, it should be like num(10) + empty("") + rev(01).

There is no so many appropriate chars for digits in all possible bases (like 0xDEADBEEF for hex, and I suppose that convert has some limit like 36), so forget about exotic digits, and use simple lists or arrays like [8888, 123, 5583] for digits in 10000-base.
Then convert limit into need base, store it.
Now generate symmetric arrays of odd and even length like
[175, 2, 175] or [13, 221, 221, 13]. If length is the same as limit length, compare array values and reject too high numbers.
You can also use limit array as starting and generate only palindromes with lesser values.

How to find all permutations of a given word in a given text?

This is an interview question (phone screen): write a function (in Java) to find all permutations of a given word that appear in a given text. For example, for word abc and text abcxyaxbcayxycab the function should return abc, bca, cab.
I would answer this question as follows:
Obviously I can loop over all permutations of the given word and use a standard substring function. However it might be difficult (for me right now) to write code to generate all word permutations.
It is easier to loop over all text substrings of the word size, sort each substring and compare it with the "sorted" given word. I can code such a function immediately.
I can probably modify some substring search algorithm but I do not remember these algorithms now.
How would you answer this question?

This is probably not the most efficient solution algorithmically, but it is clean from a class design point of view. This solution takes the approach of comparing "sorted" given words.
We can say that a word is a permutation of another if it contains the same letters in the same number. This means that you can convert the word from a String to a Map<Character,Integer>. Such conversion will have complexity O(n) where n is the length of the String, assuming that insertions in your Map implementation cost O(1).
The Map will contain as keys all the characters found in the word and as values the frequencies of the characters.
Example. abbc is converted to [a->1, b->2, c->1]
bacb is converted to [a->1, b->2, c->1]
So if you have to know if two words are one the permutation of the other, you can convert them both into maps and then invoke Map.equals.
Then you have to iterate over the text string and apply the transformation to all the substrings of the same length of the words that you are looking for.
Improvement proposed by Inerdial
This approach can be improved by updating the Map in a "rolling" fashion.
I.e. if you're matching at index i=3 in the example haystack in the OP (the substring xya), the map will be [a->1, x->1, y->1]. When advancing in the haystack, decrement the character count for haystack[i], and increment the count for haystack[i+needle.length()].
(Dropping zeroes to make sure Map.equals() works, or just implementing a custom comparison.)
Improvement proposed by Max
What if we also introduce matchedCharactersCnt variable? At the beginning of the haystack it will be 0. Every time you change your map towards the desired value - you increment the variable. Every time you change it away from the desired value - you decrement the variable. Each iteration you check if the variable is equal to the length of needle. If it is - you've found a match. It would be faster than comparing the full map every time.
Pseudocode provided by Max:
needle = "abbc"
text = "abbcbbabbcaabbca"
needleSize = needle.length()
//Map of needle character counts
targetMap = [a->1, b->2, c->1]
matchedLength = 0
curMap = [a->0, b->0, c->0]
//Initial map initialization
for (int i=0;i<needle.length();i++) {
if (curMap.contains(haystack[i])) {
matchedLength++
curMap[haystack[i]]++
}
}
if (matchedLength == needleSize) {
System.out.println("Match found at: 0");
}
//Search itself
for (int i=0;i<haystack.length()-needle.length();i++) {
int targetValue1 = targetMap[haystack[i]]; //Reading from hashmap, O(1)
int curValue1 = curMap[haystack[i]]; //Another read
//If we are removing beneficial character
if (targetValue1 > 0 && curValue1 > 0 && curValue1 <= targetValue1) {
matchedLength--;
}
curMap[haystack[i]] = curValue1 + 1; //Write to hashmap, O(1)
int targetValue2 = targetMap[haystack[i+needle.length()]] //Read
int curValue2 = curMap[haystack[i+needle.length()]] //Read
//We are adding a beneficial character
if (targetValue2 > 0 && curValue2 < targetValue2) { //If we don't need this letter at all, the amount of matched letters decreases
matchedLength++;
}
curMap[haystack[i+needle.length()]] = curValue2 + 1; //Write
if (matchedLength == needleSize) {
System.out.println("Match found at: "+(i+1));
}
}
//Basically with 4 reads and 2 writes which are
//independent of the size of the needle,
//we get to the maximal possible performance: O(n)

To find a permutation of a string you can use number theory.
But you will have to know the 'theory' behind this algorithm in advance before you can answer the question using this algorithm.
There is a method where you can calculate a hash of a string using prime numbers.
Every permutation of the same string will give the same hash value. All other string combination which is not a permutation will give some other hash value.
The hash-value is calculated by c1 * p1 + c2 * p2 + ... + cn * pn
where ci is a unique value for the current char in the string and where pi is a unique prime number value for the ci char.
Here is the implementation.
public class Main {
static int[] primes = new int[] { 2, 3, 5, 7, 11, 13, 17,
19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103 };
public static void main(String[] args) {
final char[] text = "abcxaaabbbccyaxbcayaaaxycab"
.toCharArray();
char[] abc = new char[]{'a','b','c'};
int match = val(abc);
for (int i = 0; i < text.length - 2; i++) {
char[] _123 = new char[]{text[i],text[i+1],text[i+2]};
if(val(_123)==match){
System.out.println(new String(_123) );
}
}
}
static int p(char c) {
return primes[(int)c - (int)'a'];
}
static int val(char[] cs) {
return
p(cs[0])*(int)cs[0] + p(cs[1])*(int)cs[1] + p(cs[2])*(int)cs[2];
}
}
The output of this is:
abc
bca
cab

You should be able to do this in a single pass. Start by building a map that contains all the characters in the word you're searching for. So initially the map contains [a, b, c].
Now, go through the text one character at a time. The loop looks something like this, in pseudo-code.
found_string = "";
for each character in text
if character is in map
remove character from map
append character to found_string
if map is empty
output found_string
found_string = ""
add all characters back to map
end if
else
// not a permutation of the string you're searching for
refresh map with characters from found_string
found_string = ""
end if
end for
If you want unique occurrences, change the output step so that it adds the found strings to a map. That'll eliminate duplicates.
There's the issue of words that contain duplicated letters. If that's a problem, make the key the letter and the value a count. 'Removing' a character means decrementing its count in the map. If the count goes to 0, then the character is in effect removed from the map.
The algorithm as written won't find overlapping occurrences. That is, given the text abcba, it will only find abc. If you want to handle overlapping occurrences, you can modify the algorithm so that when it finds a match, it decrements the index by one minus the length of the found string.
That was a fun puzzle. Thanks.

This is what I would do - set up a flag array with one
element equal to 0 or 1 to indicate whether that character
in STR had been matched
Set the first result string RESULT to empty.
for each character C in TEXT:
Set an array X equal to the length of STR to all zeroes.
for each character S in STR:
If C is the JTH character in STR, and
X[J] == 0, then set X[J] <= 1 and add
C to RESULT.
If the length of RESULT is equal to STR,
add RESULT to a list of permutations
and set the elements of X[] to zeroes again.
If C is not any character J in STR having X[J]==0,
then set the elements of X[] to zeroes again.

The second approach seems very elegant to me and should be perfectly acceptable. I think it scales at O(M * N log N), where N is word length and M is text length.
I can come up with a somewhat more complex O(M) algorithm:
Count the occurrence of each character in the word
Do the same for the first N (i.e. length(word)) characters of the text
Subtract the two frequency vectors, yielding subFreq
Count the number of non-zeroes in subFreq, yielding numDiff
If numDiff equals zero, there is a match
Update subFreq and numDiff in constant time by updating for the first and after-last character in the text
Go to 5 until reaching the end of the text
EDIT: See that several similar answers have been posted. Most of this algorithm is equivalent to the rolling frequency counting suggested by others. My humble addition is also updating the number of differences in a rolling fashion, yielding an O(M+N) algorithm rather than an O(M*N) one.
EDIT2: Just saw that Max has basically suggested this in the comments, so brownie points to him.

This code should do the work:
import java.util.ArrayList;
import java.util.List;
public class Permutations {
public static void main(String[] args) {
final String word = "abc";
final String text = "abcxaaabbbccyaxbcayxycab";
List<Character> charsActuallyFound = new ArrayList<Character>();
StringBuilder match = new StringBuilder(3);
for (Character c : text.toCharArray()) {
if (word.contains(c.toString()) && !charsActuallyFound.contains(c)) {
charsActuallyFound.add(c);
match.append(c);
if (match.length()==word.length())
{
System.out.println(match);
match = new StringBuilder(3);
charsActuallyFound.clear();
}
} else {
match = new StringBuilder(3);
charsActuallyFound.clear();
}
}
}
}
The charsActuallyFound List is used to keep track of character already found in the loop. It is needed to avoid mathing "aaa" "bbb" "ccc" (added by me to the text you specified).
After further reflection, I think my code only work if the given word has no duplicate characters.
The code above correctly print
abc
bca
cab
but if you seaarch for the word "aaa", then nothing is printed, because each char can not be matched more than one time. Inspired from Jim Mischel answer, I edit my code, ending with this:
import java.util.ArrayList;
import java.util.List;
public class Permutations {
public static void main(String[] args) {
final String text = "abcxaaabbbccyaxbcayaaaxycab";
printMatches("aaa", text);
printMatches("abc", text);
}
private static void printMatches(String word, String text) {
System.out.println("matches for "+word +" in "+text+":");
StringBuilder match = new StringBuilder(3);
StringBuilder notYetFounds=new StringBuilder(word);
for (Character c : text.toCharArray()) {
int idx = notYetFounds.indexOf(c.toString());
if (idx!=-1) {
notYetFounds.replace(idx,idx+1,"");
match.append(c);
if (match.length()==word.length())
{
System.out.println(match);
match = new StringBuilder(3);
notYetFounds=new StringBuilder(word);
}
} else {
match = new StringBuilder(3);
notYetFounds=new StringBuilder(word);
}
}
System.out.println();
}
}
This give me following output:
matches for aaa in abcxaaabbbccyaxbcayaaaxycab:
aaa
aaa
matches for abc in abcxaaabbbccyaxbcayaaaxycab:
abc
bca
cab
Did some benchmark, the code above found 30815 matches of "abc" in a random string of 36M in just 4,5 seconds. As Jim already said, thanks for this puzzle...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Shortening a string in Java - java

I have a requirement to shorten a 6 character string like "ABC123" into a unique 4 character string. It has to be repeatable so that the input string will always generate the same output string. Does anyone have any ideals how to do this?

Related

Extremely compact UUID (using all alphanumeric characters)

Breaking Vigenere only knowing key length

Converting letters to alphabet position with two JLists

Generate all Palindromic numbers in a given number system?

How to find all permutations of a given word in a given text?

Categories

Resources