I am trying to convert a string like "password" to hex values, then have it inside a long array, the loop working fine till reaching the value "6F" (hex value for o char) then I have an exception java.lang.NumberFormatException
String password = "password";
char array[] = password.toCharArray();
int index = 0;
for (char c : array) {
String hex = (Integer.toHexString((int) c));
data[index] = Long.parseLong(hex);
index++;
}
how can I store the 6F values inside Byte array, as the 6F is greater than 1 byte ?. Please help me on this
Long.parseLong parses decimal numbers. It turns the string "10" into the number 10. If the input is hex, that is incorrect - the string "10" is supposed to be turned into the number 16. The fix is to use the Long.parseLong(String input, int radix) method. the radix you want is 16, though writing that as 0x10 may be more readable - it's the same thing to the compiler, purely a personal style choice. Thus, Long.parseLong(hex, 0x10) is what you want.
Note that in practice char has numbers that go from 0 to 65535, which doesn't fit in bytes. In effect, you must put a marker down that passwords must not contain any characters that aren't ASCII characters (so no umlauts, snowmen, emoji, funny quotes, etc).
If you fail to check this, Integer.toHexString((int) c) will turn into something like 16F or worse (3 to 4 characters), and it may also turn into a single character.
More generally, converting from char c to a hex string, and then parse the hex string into a number, is completely pointless. It's turning 15 into "F" and then turning "F" into 15. If you just want to shove a char into a byte: data[index++] = (byte) c; is all you need - that is the only line you need in your for loop.
But, heed this:
This really isn't how you're supposed to do that!
What you're doing is converting character data to a byte array. This is not actually simple - there are only 256 possible bytes, and there are way more characters that folks have invented. Literally hundreds of thousands of them.
Thus, to convert characters to bytes or vice versa, you must apply an encoding. Encodings have wildly varying properties. The most commonly used encoding, however, is 'UTF-8'. It represent every unicode symbol, and has the interesting property that basic ASCII characters look the exact same. However, it has the downside that any given character is smeared out into 1, 2, 3, or even 4 bytes, depending on what character it is. Fortunately, java has plenty of tools for this, thus, you don't need to care. What you really want, is this:
byte[] data = password.getBytes(StandardCharsets.UTF8);
That's asking the string to turn itself into a byte array, using UTF8 encoding. That means "password" turns into the sequence '112 97 115 115 119 111 114 100' which is no doubt what you want, but you can also have as password, say, außgescheignet ☃, and that works too - it's turned into bytes, and you can get back to your snowman enabled password:
String in = "außgescheignet ☃";
byte[] data = in.getBytes(StandardCharsets.UTF8);
String andBackAgain = new String(data, StandardCharsets.UTF8);
assert in.equals(andBackAgain); // true
if you stick this in a source file, make sure you save it in whatever text editor you use to do this as UTF8, and that javac compiles it that way too (javac has an -encoding parameter to enforce this).
If you think this is going to cause issues on whatever you send this to, and you want to restrict it to what someone with a rather USA-centric view would call 'normal' characters, then you want the exact same code as showcased here, but use StandardCharsets.ASCII instead. Then, that line (password.getBytes(StandardCharsets.ASCII)) will flat out error if it includes non-ASCII characters. That's a good thing: Your infrastructure would not deal with it correctly, we just posited that in this hypothetical exercise. Throwing an exception early in the process on a relevant line is exactly what you want.
I have searched some time about this matter and didn't find proper answer anywhere.
Let's say I have a string:
"The quick brown fox jumps over the lazy dog"
I need to find unique words in this string and their byte positions and also byte distance between same words.
Ok I can manage to find words, but what is their byte position and any ideas to track distance in bytes? Is for example: 5 is the position of string quick and converted to bytes?
I hope this doesn't sound too stupid (I am fairly new to Java).
Finding unique words should be fairly easy; split on whitespace, add strings to a Set, and whatever's in the Set at the end of the method will be the unique words in the file. this can be made arbitrary complex though, depending on what defines a unique word, and if characters other than whitespace separate words.
The byte position/distance question is a bit harder. If memory serves, String objects in Java are wrappers around char[] objects, and chars are 16-bit unicode characters in Java (http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html).
So I'm guessing byte distance is just a linear function of the character position?
If you're working with other encodings though the getBytes() method might be useful.
http://docs.oracle.com/javase/tutorial/i18n/text/string.html
So for something like that, a naive solution would be to determine the number of bytes for each character, which would allow for really easy calculation of byte positions/distances, but determining that probably isn't that efficient. It should, however, yield correct results if done correctly.
Positions are counted from 0, not 1. So "quick" would have character position 5, which for US-ASCII is also the byte position. Maybe character positions suffice.
String s = "The quick brown fox jumps over the lazy dog";
int charsIndex = s.indexOf("quick"); // 4
int charsLength = "The ".length(); // 4
int bytesLength = "The ".getBytes("UTF-8").length; // 4
char ch = s.charAt(4); // 'q'
int c = s.codePointAt(4); // (int) 'q'
In Java text (String) is always in Unicode, hence all chars are possible and combinable.
Bytes (byte[]) are in some encoding and may vary per encoding.
I know that ASCII codes are between 0-127 in decimal and 0000 0000 to 0111 1111 in binary, and that values between 128-255 are extended ASCII.
I also know that int accepts 9 digits(which I was wrong the range int is between(-2,147,483,648 to 2,147,483,647)), so if we cast every number between (0-MaxintRange) to a char, there will be many many symbols; for example:
(char)999,999,999 gives 짿 which is a Korean symbol (I don't know what it even means; Google Translate can't find any meaning!).
The same thing happens with values between minintrange to 0.
It doesn't make sense that those symbols were input one by one.
I don't understand - how could they assign those big numbers to have its own character?
I don't understand how they assign those big numbers to have it's own symbol?
The assignments are made by the Unicode consortium. See http://unicode.org for details.
In your particular case however you are doing something completely nonsensical. You have the integer 999999999 which in hex is 0x3B9AC9FF. You then cast that to char, which discards the top four bytes and gives you 0xC9FF. If you then look that up at Unicode.org: http://www.unicode.org/cgi-bin/Code2Chart.pl and discover that yes, it is a Korean character.
Unicode code points can in fact be quite large; there are over a million of them. But you can't get to them just by casting. To get to Unicode code points that are outside of the "normal" range using UTF-16 (as C# does), you need to use two characters. See http://en.wikipedia.org/wiki/UTF-16, the section on surrogate pairs.
To address some of the other concerns in your question:
I know that ACCII codes are between (0-127) in decimal and (0000 0000 to 0000 1111) in binary.
That's ASCII, not ACCII, and 127 in binary is 01111111, not 00001111
Also we know that int accepts 9 digits, so if we cast every number between
The range of an int is larger than that.
don't know what it mean even Google translate can't find any meaning
Korean is not like Chinese, where each glyph represents a word. Those are letters. They don't have a meaning unless they happen to accidentally form a word. You'd have about as much luck googling randomly chosen English letters and trying to find their meaning; maybe sometimes you'd choose CAT at random, but most of the time you'd choose XRO or some such thing that is not a word.
Read this if you want to understand how the Korean alphabet works: http://en.wikipedia.org/wiki/Hangul
I'm now working on a challenge from website http://www.net-force.nl/challenges/ and I stand before an interesting problem I can't solve. I'm not asking for the whole result (as it would be breaking the rules), but I need help with the programming theory of hash function.
Basically, it's based on Java applet with one textfield, where user has to enter the right password. When I decompile the .class file, one of the methods I get is this hash method.
string s contains entered password, immediately given to the method:
private int hash(string s)
{
int i = 0;
for(int j = 0; j < s.length(); j++)
i += s.charAt(j);
return i;
}
The problem is that the method returns integer as the "hash", but how can characters be converted to integer at all? I got an idea that maybe the password is a number, but it doesn't lead anywhere at all. Another idea talks about ASCII, but still nothing.
Thanks for any help or tips.
The trick is that it's converting each character into an integer. Each character (char) in Java is a UTF-16 code unit. For the most part1, you can just think of that as each character is mapped to a number between 0 and 65535 inclusive, in a scheme called Unicode. For example, 65 is the number for 'A', and if you'd typed in the Euro symbol, that would map to Unicode U+20AC (8364).
Your hashing function basically adds together the numbers for each character in the string. It's a very poor hash (in particular it gives the same results for the same characters regardless of ordering), but hopefully you'll get the idea.
1 Things get trickier when you need to bear in mind surrogate pairs, where a single Unicode character is actually made up of two UTF-16 code units - that's for characters with a Unicode number of more than 65535. Let's stick to the basics for the moment though :)
The hash function you present is the simplest hashing function you could possibly right for a string.
It is easy to implement and really fast in its computation.
It is problematic though since it doesn't distributes the input well.
Assuming ASCII chars the hash can take values from 0 to 1016 since an ASCII char is between 0 - 127.
I.e. each character in the string is "treated" as its ASCII equivalent (For more advance analysis check #John's answer).
Anyway you should note that strings containing the same characters but in different order map to the same hash value with this function.Perhaps this is of interest to you in the challenge you are trying to attack (??)
Suppose I have
String input = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3";
I want to encode it into a string with less character and actually hides the actual information by representing it in roman character, IE. the above encodes to something like "Adqwqkjlhs". Must be able to decode to original string if given the encoded string.
The string input is actually something I parse from the hash of an URL, but the original format is lengthy and open to manipulation.
Any ideas?
Thanks
Edit #1
The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]
Edit #2 (Purpose of encoded string)
Suppose the above string encodes to bmtwva1131gpefvb1xv, then I can have URL link like www.shortstring.com/input#bmtwva1131gpefvb1xv. From there I would decode bmtwva1131gpefvb1xv into comma separate numbers.
This isn't really much of an improvement from Nathan Hughes' solution, but the longer the Strings are, the more of a savings you get.
Encoding: create a String starting with "1", making each of the numbers in the source string 2 digits, thus "0" becomes "00", "5" becomes "05", "99" becomes "99", etc. Represent the resulting number in base 36.
Decoding: Take the base 36 number/string, change it back to base 10, skip the first "1", then turn every 2 numbers/letters into an int and rebuild the original string.
Example Code:
String s = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3";
// ENCODE the string
StringTokenizer tokenizer = new StringTokenizer(s,",");
StringBuilder b = new StringBuilder();
b.append("1"); // This is a primer character, in case we end up with a bunch of zeroes at the beginning
while(tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken().trim();
if(token.length()==1) {
b.append("0");
b.append(token);
}
else {
b.append(token);
}
}
System.out.println(b);
// We get this String: 101020000000000000000000000000000000000010202030004000000040003
String encoded = (new BigInteger(b.toString())).toString(36);
System.out.println(encoded);
// We get this String: kcocwisb8v46v8lbqjw0n3oaad49dkfdbc5zl9vn
// DECODE the string
String decoded = (new BigInteger(encoded, 36)).toString();
System.out.println(decoded);
// We should get this String: 101020000000000000000000000000000000000010202030004000000040003
StringBuilder p = new StringBuilder();
int index = 1; // we skip the first "1", it was our primer
while(index<decoded.length()) {
if(index>1) {
p.append(",");
}
p.append(Integer.parseInt(decoded.substring(index,index+2)));
index = index+2;
}
System.out.println(p);
// We should get this String: 1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3
I don't know of an easy way to turn a large number into base 64. Carefully chosen symbols (like +,,-) are ok to be URL encoded, so 0-9, a-z, A-Z, with a "" and "-" makes 64. The BigInteger.toString() method only takes up to Character.MAX_RADIX which is 36 (no uppercase letters). If you can find a way to take a large number and change to base 64, then the resulting encoded String will be even shorter.
EDIT: looks like this does it for you: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html
How about saving it as a base 36 number?
In Java that would be
new java.math.BigInteger("120000000000000000012230400403").toString(36)
which would evaluate to "bmtwva1131gpefvb1xv"
You would get the original number back with
new java.math.BigInteger("bmtwva1131gpefvb1xv", 36)
It's a good point that this doesn't handle leading 0s (Thilo's suggestion of adding a leading 1 would work). About the commas: if the numbers were equally sized (01 instead of 1) then i think there wouldn't be a need to commas.
Suggest you look at base64 which provides 6 bits of information per character -- in general your encoding efficiency is log2(K) bits per symbol where K is the number of symbols in the set of allowable symbols.
For 8-bit character set, many of these are impermissible in URLs, so you need to choose some subset that are legal URL characters.
Just to clarify: I didn't mean encode your "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3" string as base64 -- I meant figure out what information you really want to encode, expressed as a string of raw binary bytes, and encode that in base64. It will exclude control characters (although you might want to use an alternate form where all 64 characters can be used in URLs without escaping) and be more efficient than converting numbers to a printable number form.
The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]
OK, now you have a clear definition. Here's a suggestion:
Convert your information from its original form to a binary number / byte array. If all you have is a string of comma-separated numbers from 0-99, then here's two options:
(slow) -- treat as numbers in base 100, convert to a BigInteger (e.g. n = n * 100 + x[i] for each number x in the array), convert to a byte array, and be sure to precede the whole thing by its length, so that "0,0,0,0" can be distinguished from "0,0" (numerically equal in base 100 but it has a different length. Then convert the result to base64.
(more efficient) -- treat as numbers in base 128 (since that is a power of 2), and use any number from 100-127 as a termination character. Each block of 6 numbers therefore contains 42 (=6*7) bits of information, which can be encoded as a string of 7 characters using base64. (Pad with termination characters as needed to reach an even multiple of 6 of the original numbers.)
Because you have a potentially variable-length array of numbers as inputs, you need to encode the length somehow -- either directly as a prefix, or indirectly by using a termination character.
For the inverse algorithm, just reverse the steps and you'll get an array of numbers from 0 to 99 -- using either the prefixed length or termination character to determine the size of the array -- which you can convert to a human-readable string separated with commas.
If you have access to the original information in a raw binary form before it's encoded as a string, use that instead. (but please post a question with the input format requirements for that information)
If numbers are between 0 and 255, you can create a byte array out of it. Once you have a byte array, you have manu choices :
Use base64 on the byte array, which will create a compact string (almost) URL compatible
Convert them to chars, using your own algorithm based on maximum values
Convert them to longs, and then use Long.toString(x,31).
To convert back, you'll obviously have to apply the chosen algorithm in the opposite way.
Modified UUENCODE:-
Split the binary into groups of 6 bits
Make an array of 64 characters (choose ones allowable and keep in ASCII order for easy search):- 0..9, A..Z, _, a..z, ~
Map between the binary and the characters.