I have a byte array, which is the hash of a file. This is made with messageDigest, so there is a padding. Then I make a shorthash, which is just the two first bytes of the hash, like this:
byte[] shorthash = new byte[2];
System.arraycopy(hash, 0, shortHash, 0, 2);
To make it readable for the user and to save it in a DB, I'm converting it to String with a Base64 Encoder:
Base64.getUrlEncoder().encodeToString(hash); //Same for shorthash
What I don't understand is:
Why is the String representing my shorthash four characters long? I thought a char was one or two bytes, so since I'm copying only two bytes, I shouldn't have more than two chars, right?
Why isn't my shorthash String the same as the start of the hash String?
For example, I'll have :
Hash: LE5D8vCsMp3Lcf-RBwBRbO1v4soGq7BBZ9kB_2SJnGY=
Shorthash: Rak=
You can see the = at the end of each; it certainly comes from the MessageDigest padding, so it is normal for the hash, but why for the shorthash? It should be the two FIRST bytes, and the = is at the end!
Moreover: since I wanted to get rid of this Padding, I decided to do that:
String finalHash = Base64.getUrlEncoder().withoutPadding().encodeToString(hash);
byte[] shorthash = new byte[2];
System.arraycopy(hash.getBytes(), 0, shortHash, 0, 2);
String finalShorthash = Base64.getUrlEncoder().encodeToString(shorthash);
I didn't wanted to copy directly the String, since, I'm not really sure what would be two bytes in a string.
Then, the = is gone for my hash, but not for my shorthash. I guess I need to add the "withoutPadding" option to my shorthash, but I don't understand why, since it's a copy of my hash who shouldn't have padding anymore. Except if the padding is gone only on the String representation and not in the Byte behind it?
Can someone explain this behavior? Does it comes from the conversion between byte[] and String?
"Why is the String representing my shorthash four characters long?"
Because you base64 encoded it. Each base64 digit represents exactly 6 bits of data. You have 16 bits. 2 digits is not enough (just 12 bits), so you need 3 digits to represent those bits. The 4th digit is padding, because base64 usually gets normalized to be a multiple of 4 digits.
Related
I have String hash in hex form ("e6fb06210fafc02fd7479ddbed2d042cc3a5155e") and I would like to compare it to crypt.digest().
One way, which works fine, is to convert crypt.digest() to hex, but I would like to avoid multiple conversions and rather convert hash from hex form (above) to byte array.
What I tried was:
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
but it does not match with crypt.digest(). When I convert hashBytes back to hex I get "00e6fb06210fafc02fd7479ddbed2d042cc3a5155e".
The leading zeros seem to be the reason why I fail to match byte arrays. Why do they occur? How can I get the same result using crypt.digest() and toByteArray?
The reason for the extra 00 is that e6 has it high (sign) bit set.
A redundant byte 00 makes it an unsigned value for BigInteger.
String hash = "e6fb06210fafc02fd7479ddbed2d042cc3a5155e";
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
hashBytes = hashBytes.length > 1 && hashBytes[0] == 0
? Arrays.copyOfRange(hashBytes, 1, hashBytes.length) : hashBytes;
System.out.println(Arrays.toString(hashBytes));
The question arises, what if the hash actually starts with a 00?
Then you need the hash length, or do a lenient comparison.
The answer can be found in the following answer from a thread about the highly related question Convert a string representation of a hex dump to a byte array using Java?:
The issue with BigInteger is that there must be a "sign bit". If the leading byte has the high bit set then the resulting byte array has an extra 0 in the 1st position. But still +1.
– Gray Oct 28 '11 at 16:20
Since the first bit has a special meaning (indicating the sign, 0 for positive, 1 for negative), BigInteger will prefix the data with an additional 0 in case your data started with a 1 on the high bit. Otherwise it would be interpreted as negative although it was not negative to begin with.
I.e. data like
101110
is turned into
0101110
You could easily undo this manually by using Arrays.copyOfRange(data, 1, data.length) if it happens.
However, instead of fixing that code, I would suggest using one of the other solutions posted in the linked thread. They are cleaner and easier to read and maintain.
I been reading about encoding Unicode Java 9 compact Strings in the last two days i am getting quite well. But there is something that i dont understand.
About byte data type
1). Is a 8-bit storage ranges from -128 to 127
Questions
1). Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
2). Does the negative value mean something i mean i have try a simple example using Java 11
final char value = (char)200;//in byte would overflow
final String stringValue = new String(new char[]{value});
System.out.println(stringValue);//THE SAME VALUE OF JAVA 8
I have checked the String.value variable and i see a byte array of
System.out.println(value[0]);//-56
The same questions like before arise does the -56 mean something i mean the (negative value) in other languages this overflow is detected to return to the value 200? How can Java know that -56 value is the same as 200 in char.
I have try hardest examples like codepoint 128048 and i see in String.value variable a array of bytes like this.
0 = 61
1 = -40
2 = 48
3 = -36
I know this codepoint takes 4 bytes but i get it how is transformed char[] to byte[] but i dont know how String handle this byte[] data.
Sorry if this question is simple and sorry any typing english is not my natural language thanks a lot.
Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
Java’s primitive data types were settled with Java 1.0 a quarter century ago. The compact strings were introduced in Java 9, less than two years ago. This new feature, which is merely an implementation detail, did not justify fundamental changes at Java’s type system.
Besides that, you are looking at one interpretation of the data stored in a byte. For the sake of representing iso-latin-1 units, it is entirely irrelevant whether interpreting the same data as Java’s built-in signed byte would result in a positive or negative number.
Likewise Java’s I/O API allows reading a file into a byte[] array and write byte[] arrays back to files and these two operations are already sufficient to copy a file losslessly, regardless of its file format which would be relevant when interpreting its content.
So the following works since Java 1.1:
byte[] bytes = "È".getBytes("iso-8859-1");
System.out.println(bytes[0]);
System.out.println(bytes[0] & 0xff);
-56
200
The two numbers, -56 and 200 are just different interpretations of the bit pattern 11001000 whereas the iso-latin-1 interpretation of a byte containing the bit pattern 11001000 is the character È.
A char value is also just an interpretation of a two byte quantity, i.e. as UTF-16 code unit. Likewise, a char[] array is a sequence of bytes in the computer’s memory with a standard interpretation.
We can also interpret other byte sequences this way.
StringBuilder sb = new StringBuilder().appendCodePoint(128048);
byte[] array = new byte[4];
StandardCharsets.UTF_16LE.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
System.out.println(Arrays.toString(array));
will print the value you’ve seen, [61, -40, 48, -36].
The advantage of using a byte[] array inside the String class is, that now, the interpretation can be chosen, to use iso-latin-1 when all characters are representable with this encoding or utf-16 otherwise.
The possible numeric interpretations are irrelevant to the string. However, when you ask “How can Java know that -56 value is the same as 200”, you should ask yourself, how does it know that the bit pattern 11001000 of a byte is -56 in the first place?
System.out.println(value[0]);
bears an actually expensive operation, compared to ordinary computer arithmetic, the conversion of a byte (or an int) to a String. This conversion operation is often overlooked as it has been defined as the default way of printing a byte, but is not more natural than a conversion to a String interpreting the value as an unsigned quantity. For further reading, I recommend Two's complement.
This is because not all bytes in a string are interpreted the same. This depends to the string's character encoding.
Example:
if a string is an UTF-8 string, its characters will be 8-bits in size.
in an UTF-16 string, its characters will be 16-bits in size.
etc...
This means, if the string is to be represented as UTF-8, the characters will be made by reading 1 byte at a time; if 16-bits, the characters will made by reading 2 bytes at a time.
Look at this code: a single byte array data is transformed to string using UTF-8 and UTF-16.
byte[] data = new byte[] {97, 98, 99, 100};
System.out.println(new String(data, StandardCharsets.UTF_8));
System.out.println(new String(data, StandardCharsets.UTF_16));
The output of this code is:
abcd // 4 bytes = 4 chars, 1 byte per char
慢捤 // 4 bytes = 2 chars, 2 byte per char
Going back to the question, what motivated the developers to do so is to reduce memory footprint on strings. Not all strings uses all the 16-bits a char offers.
EDIT: Code here
Let's say I have a byte array and I try to encode it to UTF_8 using the following
String tekst = new String(result2, StandardCharsets.UTF_8);
System.out.println(tekst);
//where result2 is the byte array
Then, I get the bytes using getBytes() with values from 0 to 128
byte[] orig = tekst.getBytes();
And then, I wish to do a frequency count of my byte[] orig using the ff:
int frequencies = new int[256];
for (byte b: orig){
frequencies[b]++;
}
Everything goes well till I encounter an error which states
java.lang.ArrayIndexOutOfBoundsException: -61
Does that mean that my byte still contains negative values despite converting it to UTF-8? Is there something wrong that I'm doing? Can someone please give me clarity on this cause I'm still a beginner on the subject. Thank you.
Answering the specific question
Does that mean that my byte still contains negative values despite converting it to UTF-8?
Yes, absolutely. That's because byte is signed in Java. A byte value of -61 would be 195 as an unsigned value. You should expect to get bytes which aren't in the range 0-127 when you encode any non-ASCII text with UTF-8.
The fix is easy: just clamp the range to 0-255 with a bit mask:
frequencies[b & 0xff]++;
Addressing what you're attempting to do
This line:
String tekst = new String(result2, StandardCharsets.UTF_8);
... is only appropriate if result2 is genuinely UTF-8-encoded text. It's not appropriate if result2 is some arbitrary binary data such as an image, compressed data, or even text encoded in some other encoding.
If you want to preserve arbitrary binary data as a string, you should use something like Base64 or hex. Basically, you need to determine whether your data is inherently textual (in which case, you should use strings for as much of the time as possible, and use an appropriate Charset to convert to binary where necessary) or inherently binary (in which case you should use bytes for as much of the time as possible, and use base64 or hex to convert to text where necessary).
This line:
byte[] orig = tekst.getBytes();
... is almost always a bad idea. It uses the platform-default encoding to convert a string to bytes. If you really, really want to use the platform-default encoding, I would make that explicit:
byte[] orig = tekst.getBytes(Charset.defaultCharset());
... but this is an extremely unusual requirement these days. It's almost always better to stick to UTF-8 everywhere.
How would I go about doing that? I tried using SHA-1 and MD5 but the output is too long for my requirements and truncation would not make it unique.
Input : String containing numbers e.g. (0302160123456789)
Received output : 30f2bddc3e2fba9c05d97d04f8da4449
Desired Output: Unique number within range (0000000000000000 - FFFFFFFFFFFFFFFF) and 16 characters long
Any help/ pointers are greatly appreciated.
How big is your input domain? If it is bigger than your output domain, then the Pigeon Hole principle applies and you can't get unique output by definition.
If the input domain is smaller or equal to the output domain, then you can easily accomplish this with a Pseudo-Random Permutation (PRP) that block ciphers provide.
The output of 16 hexits is equivalent to 8 bytes and equivalent to 64 bit. DES (and Triple DES) is a block cipher that has this block size.
Parse the input string to a byte array in a compact fashion. If the input always consists of numerical digits, you can use Ebbe M. Pedersen's approach with
byte[] plaintext = new BigInteger("0302160123456789").toByteArray();
Then you can generate some random, but fixed key of 24 bytes for Triple DES and instantiate the cipher with:
Cipher c = Cipher.getInstance("DESede/ECB/PKCS5Padding");
c.init(Cipher.ENCRYPT_MODE, new SecretKeySpec(key, "DESede"));
byte[] ciphertext = c.doFinal(plaintext);
Use some kind of Hex converter to get the representation you want.
You can "hash" numbers up to 36028797018963968 with this. If you want larger numbers (up to 9223372036854775808), then you need to use "DESede/ECB/NoPadding" and pad yourself with some padding bytes.
are you going to receive more than FFFFFFFFFFFFFFFF different strings?
if not then it's a simple problem of generating integers: the first string will get 0 the next 1 etc; you just keep a list of the strings and check if something the same appears.
You could just convert your number to hex using BigInteger like this:
String id = new BigInteger("0302160123456789").toString(16);
System.out.println(id);
That gives:
112d022d2ed15
This is part of a larger assignment that I've mostly got done except for this one part, which is a bit embarrassing because it sounds really simply on paper.
So basically, I've got a large amount of compressed data. I've been keeping track of the length using a CRC32
CRC32 checksum = new CRC32();
...
//read input into buffer
checksum.update(buff, 0, bytesRead);
So it updates everytime more info is read in. I've also kept track of the uncompress length using
uncompressedLength += manage.read(buff);
So it is an int value that has the number of bytes of the original file. This is a little Endian machine.
From what I can tell, what I need is four byte CRC, which I used
public byte[] longToBytes(long x) {
ByteBuffer buffer = ByteBuffer.allocate(8);
buffer.putLong(x);
return buffer.array();
}
byte[] c = longToBytes(checksum.getValue());
BUT this is 8 bytes. CRC32.getValue returns a long. Can I convert it to an int in this case without losing information I need?
And then the ISIZE is supposed to be...the four byte compressed length modulo 2^32. I've got the variable uncompresedLength which is an int. I think I just have to convert it to bytes and that's all?
I've been hexdumping the result from gzip and the result from my program and my header and data are right, I'm just missing my trailer.
As for why I'm doing this manually, it's because of an assignment. Trust me, I'd love to just use GZIPOoutputStream if I could.
CRC32 has 32 bits... the class returns long because of the super interface.
uncompressed length should be long, since nowadays files larger than 2G isn't uncommon.
so in both cases, you need to convert the lowest 32 bits of a long to 4 bytes.
static byte[] lower4bytes(long v)
{
return new byte[] {
(byte)(v ),
(byte)(v>> 8),
(byte)(v>>16),
(byte)(v>>24)
};
}
To write an integer in little-endian form, simply write the low byte of the integer (i.e. modulo 256 or anded with 0xff), then shift it down eight bits or divide by 256, then write the resulting low byte, and repeat that two more times. You'll write four bytes. Since you only write four, you will automatically be writing the length modulo 232.