Hashing string into integer in Java applet - how does it work? - java

I'm now working on a challenge from website http://www.net-force.nl/challenges/ and I stand before an interesting problem I can't solve. I'm not asking for the whole result (as it would be breaking the rules), but I need help with the programming theory of hash function.
Basically, it's based on Java applet with one textfield, where user has to enter the right password. When I decompile the .class file, one of the methods I get is this hash method.
string s contains entered password, immediately given to the method:
private int hash(string s)
{
int i = 0;
for(int j = 0; j < s.length(); j++)
i += s.charAt(j);
return i;
}
The problem is that the method returns integer as the "hash", but how can characters be converted to integer at all? I got an idea that maybe the password is a number, but it doesn't lead anywhere at all. Another idea talks about ASCII, but still nothing.
Thanks for any help or tips.

The trick is that it's converting each character into an integer. Each character (char) in Java is a UTF-16 code unit. For the most part1, you can just think of that as each character is mapped to a number between 0 and 65535 inclusive, in a scheme called Unicode. For example, 65 is the number for 'A', and if you'd typed in the Euro symbol, that would map to Unicode U+20AC (8364).
Your hashing function basically adds together the numbers for each character in the string. It's a very poor hash (in particular it gives the same results for the same characters regardless of ordering), but hopefully you'll get the idea.
1 Things get trickier when you need to bear in mind surrogate pairs, where a single Unicode character is actually made up of two UTF-16 code units - that's for characters with a Unicode number of more than 65535. Let's stick to the basics for the moment though :)

The hash function you present is the simplest hashing function you could possibly right for a string.
It is easy to implement and really fast in its computation.
It is problematic though since it doesn't distributes the input well.
Assuming ASCII chars the hash can take values from 0 to 1016 since an ASCII char is between 0 - 127.
I.e. each character in the string is "treated" as its ASCII equivalent (For more advance analysis check #John's answer).
Anyway you should note that strings containing the same characters but in different order map to the same hash value with this function.Perhaps this is of interest to you in the challenge you are trying to attack (??)

Related

Store data in Byte array in java

I am trying to convert a string like "password" to hex values, then have it inside a long array, the loop working fine till reaching the value "6F" (hex value for o char) then I have an exception java.lang.NumberFormatException
String password = "password";
char array[] = password.toCharArray();
int index = 0;
for (char c : array) {
String hex = (Integer.toHexString((int) c));
data[index] = Long.parseLong(hex);
index++;
}
how can I store the 6F values inside Byte array, as the 6F is greater than 1 byte ?. Please help me on this
Long.parseLong parses decimal numbers. It turns the string "10" into the number 10. If the input is hex, that is incorrect - the string "10" is supposed to be turned into the number 16. The fix is to use the Long.parseLong(String input, int radix) method. the radix you want is 16, though writing that as 0x10 may be more readable - it's the same thing to the compiler, purely a personal style choice. Thus, Long.parseLong(hex, 0x10) is what you want.
Note that in practice char has numbers that go from 0 to 65535, which doesn't fit in bytes. In effect, you must put a marker down that passwords must not contain any characters that aren't ASCII characters (so no umlauts, snowmen, emoji, funny quotes, etc).
If you fail to check this, Integer.toHexString((int) c) will turn into something like 16F or worse (3 to 4 characters), and it may also turn into a single character.
More generally, converting from char c to a hex string, and then parse the hex string into a number, is completely pointless. It's turning 15 into "F" and then turning "F" into 15. If you just want to shove a char into a byte: data[index++] = (byte) c; is all you need - that is the only line you need in your for loop.
But, heed this:
This really isn't how you're supposed to do that!
What you're doing is converting character data to a byte array. This is not actually simple - there are only 256 possible bytes, and there are way more characters that folks have invented. Literally hundreds of thousands of them.
Thus, to convert characters to bytes or vice versa, you must apply an encoding. Encodings have wildly varying properties. The most commonly used encoding, however, is 'UTF-8'. It represent every unicode symbol, and has the interesting property that basic ASCII characters look the exact same. However, it has the downside that any given character is smeared out into 1, 2, 3, or even 4 bytes, depending on what character it is. Fortunately, java has plenty of tools for this, thus, you don't need to care. What you really want, is this:
byte[] data = password.getBytes(StandardCharsets.UTF8);
That's asking the string to turn itself into a byte array, using UTF8 encoding. That means "password" turns into the sequence '112 97 115 115 119 111 114 100' which is no doubt what you want, but you can also have as password, say, außgescheignet ☃, and that works too - it's turned into bytes, and you can get back to your snowman enabled password:
String in = "außgescheignet ☃";
byte[] data = in.getBytes(StandardCharsets.UTF8);
String andBackAgain = new String(data, StandardCharsets.UTF8);
assert in.equals(andBackAgain); // true
if you stick this in a source file, make sure you save it in whatever text editor you use to do this as UTF8, and that javac compiles it that way too (javac has an -encoding parameter to enforce this).
If you think this is going to cause issues on whatever you send this to, and you want to restrict it to what someone with a rather USA-centric view would call 'normal' characters, then you want the exact same code as showcased here, but use StandardCharsets.ASCII instead. Then, that line (password.getBytes(StandardCharsets.ASCII)) will flat out error if it includes non-ASCII characters. That's a good thing: Your infrastructure would not deal with it correctly, we just posited that in this hypothetical exercise. Throwing an exception early in the process on a relevant line is exactly what you want.

implement an algorithm to determine if a string has all unique characters (characters greater than U+FFFF)

I was practicing example interview questions and one of them was:
"implement an algorithm to determine if a string has all unique characters".
It's easy when we assume that is ASCII/ANSI.
implement-an-algorithm-to-determine-if-a-string-has-all-unique-charact
But my question is: how should that be solved if let's say string can contain e.g. hieroglyphic symbols or whatever (code points are greater than U+FFFF... ?).
So if I understood it correctly I can easily think of solution if given string contains characters that belong to the set of characters from U+0000 to U+FFFF - they can be converted into 16-bit char, but what if I encounter a character whose code points are greater than U+FFFF... ?
Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF)
But I have no idea how to solve this puzzle in that case, how do I handle those surrogate pairs ?
Thanks!
Java 8 has a CharSequence#codePoints method that produces an IntStream of the Unicode codepoints in a string. From there it just becomes a matter of writing code to test uniqueness of elements in the IntStream.
If you're still in Java 7 or below, there are codepoint-based methods in there that can be used to solve this as well, but they much more complex to use. You'd have to loop over the chars of the string and examine each one's value to tell whether you're dealing with surrogate pairs or not. Something like (thoroughly untested):
for (int i = 0; i < str.length(); i++) {
int codepoint = str.codePointAt(i++);
if (Character.isHighSurrogate(str.charAt(i))) {
// This will fail if the UTF-16 representation of
// this string is wrong (e.g., high surrogate `char`
// at the end of the string's `char[]`).
i += 1;
}
// do stuff with codepoint...
}

I am having trouble creating a 16bit char in java

How can I create a variable character that can hold a four byte value?
I am trying to write an program to encrypt messages in java, for fun. I figured out how to use RSA, and managed to write a program that will encrypt a message and save it to a .txt file.
For example if "Quiet" is entered the outcome will be "041891090280". I wrote my code so that the number would always have length that is a multiple of six. So I thought that I could convert the numbers into a hash code. The first three letters are "041" so I could convert that into ")".
However I am having trouble created a char with a number greater than 255. I have looked around online and found a few examples, but I can't figure out how to implement them. I created a new method just to test them.
int a = 256;
char b = (char) a;
char c = 0xD836;
char[] cc = Character.toChars(0x1D50A);
System.out.println(b);
System.out.println(c);
System.out.println(cc);
The program outputs
?
?
?
I am only getting two bytes. I read that Java uses Unicode which should go up to 65535 which is four bytes. I am using eclipse if that makes a difference.
I apologize for the noob question.
And thanks in advance.
edit
I am sorry, I think I gave too much information and ended up being confusion.
What I want to do is store a string of numbers as a string of unicode characters. the only way I know how to do that is to break up the number string small enough to fit it into a character. then add the characters one by one to a new string. But I don't know how to add a variable unicode character to a string.
All chars are 16-bit already. 0 to 65535 only need 16-bit and 2^16 = 65536.
Note: not all characters are valid and in particular, 0xD800 to 0xDFFF are used for encoding code points (characters beyond 65536)
If you want to be able to store all possible 16-bit values I suggest you use short instead. You can store the same values but it may be less confusing to use.

offsetByCodePoints vs integer iterator

Is there any advantage to using String.offsetByCodePoints instead of just using an integer index to keep track of where you are in a string?
It might be useful if the string contains characters from the Unicode Supplementary Planes (unusual characters with a high code point / character code). Java strings use UTF-16 encoding internally, which means that some Unicode characters must be represented as a sequence of two char values, also known as a surrogate pair. Thus, although s.charAt(i) will give you the i'th char of s, this might not actually be the i'th character. s.offsetByCodePoints(0, i) will tell you the index where the i'th character starts.
If you are unfamiliar with some of the terms above, you should read Joel Spolsky's excellent article on character sets.

Java Regex Range for ASCII

Reworded question as it seems I wasn't specific enough;
Given a RSA system with p = 263, q = 587, public key e = 683 and private key d = 81599. Therefore n = pq = 154381. For a message, say "I AM A STUDENT", the encryption is conducted as follows:
Convert any letter (including blank space) into a 3-digit ASCII code, i.e. 073 032 065 077 032 065 032 083 084 085 068 069 078 084.
Join every two adjacent ASCII codes to form a block, i.e. 073032 065077 032065 032083 084085 068069 078084. (use 000 if last letter has nothing to join with).
Using the encryption algorithm c = me mod n to encrypt every block; c1 = 73032683 mod 154381 = 103300, etc.
Assume you are the receiver of a message: 33815872282353670979238213794429016637939017111351. What is the content?
After a bit more consideration, I'm thinking that since I have to decode in parts, i.e. decode 33815 then 87228, etc., etc. That I should just split the decoded part in half, and check if each half is in the ascii range, if not, go back to the original and split it differently. Does this sound like a better solution than trying to hack something out with regex?
P.S. The decoding is considered homework, I have done this by hand and know that the message decodes to "i hate cryptography" (it seems my lecturer has a sense of humor), so you're not helping me do my homework. Turning this into a program is just something extra curricular that I thought might be fun/interesting.
It is generally an incredibly bad idea to have variable length records without a delimeter or index. In this case, the best approach is having a fixed width integer, with leading zeros.
That said, you do actually have an implicit delimiter, assuming you're always reading from start to end of the string without skipping at all. If you take 0 or 1 to indicate that it is a 3 digit number, and 2-9 to indicate a 2 digit number. Something like this would work:
[01][0-9][0-9]|[2-9][0-9]
But really - just print your numbers into the string with leading zeros. Or look into 2 character hexadecimal encoding if you're worried about space. Or Base 64, or one of the other printable encodings.

Categories