String hex hash to bytes - java

I have String hash in hex form ("e6fb06210fafc02fd7479ddbed2d042cc3a5155e") and I would like to compare it to crypt.digest().
One way, which works fine, is to convert crypt.digest() to hex, but I would like to avoid multiple conversions and rather convert hash from hex form (above) to byte array.
What I tried was:
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
but it does not match with crypt.digest(). When I convert hashBytes back to hex I get "00e6fb06210fafc02fd7479ddbed2d042cc3a5155e".
The leading zeros seem to be the reason why I fail to match byte arrays. Why do they occur? How can I get the same result using crypt.digest() and toByteArray?

The reason for the extra 00 is that e6 has it high (sign) bit set.
A redundant byte 00 makes it an unsigned value for BigInteger.
String hash = "e6fb06210fafc02fd7479ddbed2d042cc3a5155e";
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
hashBytes = hashBytes.length > 1 && hashBytes[0] == 0
? Arrays.copyOfRange(hashBytes, 1, hashBytes.length) : hashBytes;
System.out.println(Arrays.toString(hashBytes));
The question arises, what if the hash actually starts with a 00?
Then you need the hash length, or do a lenient comparison.

The answer can be found in the following answer from a thread about the highly related question Convert a string representation of a hex dump to a byte array using Java?:
The issue with BigInteger is that there must be a "sign bit". If the leading byte has the high bit set then the resulting byte array has an extra 0 in the 1st position. But still +1.
– Gray Oct 28 '11 at 16:20
Since the first bit has a special meaning (indicating the sign, 0 for positive, 1 for negative), BigInteger will prefix the data with an additional 0 in case your data started with a 1 on the high bit. Otherwise it would be interpreted as negative although it was not negative to begin with.
I.e. data like
101110
is turned into
0101110
You could easily undo this manually by using Arrays.copyOfRange(data, 1, data.length) if it happens.
However, instead of fixing that code, I would suggest using one of the other solutions posted in the linked thread. They are cleaner and easier to read and maintain.

Related

Reading 31 bit Integer from Java InputStream

I'm trying to read a 31 bit long Integer from an InputStream in java and i can't figure a way out for doing this. I receive four byte from the InputStream and the first bit of the first byte is a reserved bit which is always unset (0x0) and the rest is 31 bit long integer.Here is a visualization of what i described :
+-+-------------+---------------+-------------------------------+
|R| 31 bit long Integer |
+-+-------------------------------------------------------------+
I would appreciate it if you could help me come up with a solution. Thanks!
I'm trying to read a 31 bit long Integer from an InputStream
That is impossible.
The minimum size of thing you can read is a byte, which is 8 bits; all things you can read from them are a multiple of 8.
and the first bit of the first byte is a reversed byte which is always onset (0x0)
This sentence doesn't make any sense. The first 'bit of a byte' cannot be a 'reversed byte'. Given that bits are a 1-dimensional concept, there is no such thing as a 'reversed bit', and 'onset', if it means anything, means '1' and not '0', and bits are not as a rule communicated in '0x' syntax, which is hexadecimal.
I conclude you must be confused about the API.
However, to be a bit more helpful: If you have 4 bytes of data that contains a 31-bit-length integer, then:
You need to know if it is 'big endian' or 'little endian'. It will be someplace in the docs; usually protocols are big endian.
That first bit can trivially be stripped away or isolated, which should help.
Assuming big endian:
try (InputStream raw = socket.getInputStream();
DataInputStream data = new DataInputStream(raw)) {
int v = data.readInt();
boolean isolatedBit = (v >>> 31) != 0;
v = v & 0x7FFFFFFF;
}
DataInputStream has the readInt() call that takes care of business.
isolatedBit will be 0 if that 'R' bit is unset, and '1' if it is iset.
Even if this R thing is set, that last line will ensure that the value of v has that bit unset. As a consequence, the number will be between 0 and 2^31-1 (thus, always positive).
NB: After some corrections to the original question, this is much simpler:
Given that the reserved bit is always unset, you can just call int v = data.readInt(), that's the only thing in the try block that would then be required. Had the 'reserved bit' always been a 1 - you would need that & 0x7FFFFFFF to get rid of it.

Java 11 Compact Strings magic behind char[] to byte[]

I been reading about encoding Unicode Java 9 compact Strings in the last two days i am getting quite well. But there is something that i dont understand.
About byte data type
1). Is a 8-bit storage ranges from -128 to 127
Questions
1). Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
2). Does the negative value mean something i mean i have try a simple example using Java 11
final char value = (char)200;//in byte would overflow
final String stringValue = new String(new char[]{value});
System.out.println(stringValue);//THE SAME VALUE OF JAVA 8
I have checked the String.value variable and i see a byte array of
System.out.println(value[0]);//-56
The same questions like before arise does the -56 mean something i mean the (negative value) in other languages this overflow is detected to return to the value 200? How can Java know that -56 value is the same as 200 in char.
I have try hardest examples like codepoint 128048 and i see in String.value variable a array of bytes like this.
0 = 61
1 = -40
2 = 48
3 = -36
I know this codepoint takes 4 bytes but i get it how is transformed char[] to byte[] but i dont know how String handle this byte[] data.
Sorry if this question is simple and sorry any typing english is not my natural language thanks a lot.
Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
Java’s primitive data types were settled with Java 1.0 a quarter century ago. The compact strings were introduced in Java 9, less than two years ago. This new feature, which is merely an implementation detail, did not justify fundamental changes at Java’s type system.
Besides that, you are looking at one interpretation of the data stored in a byte. For the sake of representing iso-latin-1 units, it is entirely irrelevant whether interpreting the same data as Java’s built-in signed byte would result in a positive or negative number.
Likewise Java’s I/O API allows reading a file into a byte[] array and write byte[] arrays back to files and these two operations are already sufficient to copy a file losslessly, regardless of its file format which would be relevant when interpreting its content.
So the following works since Java 1.1:
byte[] bytes = "È".getBytes("iso-8859-1");
System.out.println(bytes[0]);
System.out.println(bytes[0] & 0xff);
-56
200
The two numbers, -56 and 200 are just different interpretations of the bit pattern 11001000 whereas the iso-latin-1 interpretation of a byte containing the bit pattern 11001000 is the character È.
A char value is also just an interpretation of a two byte quantity, i.e. as UTF-16 code unit. Likewise, a char[] array is a sequence of bytes in the computer’s memory with a standard interpretation.
We can also interpret other byte sequences this way.
StringBuilder sb = new StringBuilder().appendCodePoint(128048);
byte[] array = new byte[4];
StandardCharsets.UTF_16LE.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
System.out.println(Arrays.toString(array));
will print the value you’ve seen, [61, -40, 48, -36].
The advantage of using a byte[] array inside the String class is, that now, the interpretation can be chosen, to use iso-latin-1 when all characters are representable with this encoding or utf-16 otherwise.
The possible numeric interpretations are irrelevant to the string. However, when you ask “How can Java know that -56 value is the same as 200”, you should ask yourself, how does it know that the bit pattern 11001000 of a byte is -56 in the first place?
System.out.println(value[0]);
bears an actually expensive operation, compared to ordinary computer arithmetic, the conversion of a byte (or an int) to a String. This conversion operation is often overlooked as it has been defined as the default way of printing a byte, but is not more natural than a conversion to a String interpreting the value as an unsigned quantity. For further reading, I recommend Two's complement.
This is because not all bytes in a string are interpreted the same. This depends to the string's character encoding.
Example:
if a string is an UTF-8 string, its characters will be 8-bits in size.
in an UTF-16 string, its characters will be 16-bits in size.
etc...
This means, if the string is to be represented as UTF-8, the characters will be made by reading 1 byte at a time; if 16-bits, the characters will made by reading 2 bytes at a time.
Look at this code: a single byte array data is transformed to string using UTF-8 and UTF-16.
byte[] data = new byte[] {97, 98, 99, 100};
System.out.println(new String(data, StandardCharsets.UTF_8));
System.out.println(new String(data, StandardCharsets.UTF_16));
The output of this code is:
abcd // 4 bytes = 4 chars, 1 byte per char
慢捤 // 4 bytes = 2 chars, 2 byte per char
Going back to the question, what motivated the developers to do so is to reduce memory footprint on strings. Not all strings uses all the 16-bits a char offers.
EDIT: Code here

Converting String to UTF-8 byte array returns a negative value in Java

Let's say I have a byte array and I try to encode it to UTF_8 using the following
String tekst = new String(result2, StandardCharsets.UTF_8);
System.out.println(tekst);
//where result2 is the byte array
Then, I get the bytes using getBytes() with values from 0 to 128
byte[] orig = tekst.getBytes();
And then, I wish to do a frequency count of my byte[] orig using the ff:
int frequencies = new int[256];
for (byte b: orig){
frequencies[b]++;
}
Everything goes well till I encounter an error which states
java.lang.ArrayIndexOutOfBoundsException: -61
Does that mean that my byte still contains negative values despite converting it to UTF-8? Is there something wrong that I'm doing? Can someone please give me clarity on this cause I'm still a beginner on the subject. Thank you.
Answering the specific question
Does that mean that my byte still contains negative values despite converting it to UTF-8?
Yes, absolutely. That's because byte is signed in Java. A byte value of -61 would be 195 as an unsigned value. You should expect to get bytes which aren't in the range 0-127 when you encode any non-ASCII text with UTF-8.
The fix is easy: just clamp the range to 0-255 with a bit mask:
frequencies[b & 0xff]++;
Addressing what you're attempting to do
This line:
String tekst = new String(result2, StandardCharsets.UTF_8);
... is only appropriate if result2 is genuinely UTF-8-encoded text. It's not appropriate if result2 is some arbitrary binary data such as an image, compressed data, or even text encoded in some other encoding.
If you want to preserve arbitrary binary data as a string, you should use something like Base64 or hex. Basically, you need to determine whether your data is inherently textual (in which case, you should use strings for as much of the time as possible, and use an appropriate Charset to convert to binary where necessary) or inherently binary (in which case you should use bytes for as much of the time as possible, and use base64 or hex to convert to text where necessary).
This line:
byte[] orig = tekst.getBytes();
... is almost always a bad idea. It uses the platform-default encoding to convert a string to bytes. If you really, really want to use the platform-default encoding, I would make that explicit:
byte[] orig = tekst.getBytes(Charset.defaultCharset());
... but this is an extremely unusual requirement these days. It's almost always better to stick to UTF-8 everywhere.

Combining elements of a byte[] array into 16-bit numbers

This is an excerpt of code from a music tuner application. A byte[] array is created, audio data is read into the buffer arrays, and then the for loop iterates through buffer and combines the values at indices n,n+1, to create an array of 16-bit numbers that is half the length.
byte[] buffer = new byte[2*1200];
targetDataLine.read(buffer, 0, buffer.length)
for ( int i = 0; i < n; i+=2 ) {
int value = (short)((buffer[i]&0xFF) | ((buffer[i+1]&0xFF) << 8)); //**Don't understand**
a[i >> 1] = value;
}
So far, what I have is this:
From a different SO post, I learned that every byte being stored in a larger type must be & with 0xFF, due to its conversion to a 32-bit number. I guess the leading 24 bits are filled with 1s (though I don't know why it isn't filled with zeros... wouldn't leading with 1s change the value of the number? 000000000010 (2) is different from 111111110010 (-14), after all.), so the purpose of 0xff is to only grab the last 8 bits (which is the whole byte).
When buffer[i+1] is shifted left by 8 bits, this makes it so that, when ORing, the eight bits from buffer[i+1] are in the most significant positions, and the eight bits from buffer[i] are in the least significant eight bits. We wind up with a 16-bit number that is of the form buffer[i+1] + buffer[i]. (I'm using + but I understand it's closer to concatenation.)
First, why are we ORing buffer[i] | buffer[i+1] << 8? This seems to destroy the original sound information unless we pull it back out in the same way; while I understand that OR will combine them into one value, I don't see how that value can be useful or used in calculations later. And the only way this data is accessed later is as its literal values:
diff += Math.abs(a[j]-a[i+j];
If I have 101 and 111, added together I should get 12, or 1100. Yet 101 | 111 << 3 gives 111101, which is equal to 61. The closest I got to understanding was that 101 (5) | 111000 (56) is the same as adding 5+56=61. But the order matters -- doing the reverse 101 <<3 | 111 is completely different. I really don't understand how the data can remain useful, when it is OR'd in this way.
The other problem I'm having is that, because Java uses signed bytes, the eighth position doesn't indicate the value, but the sign. If I'm ORing two binary signed numbers, then in the resulting 16-bit number, the bit at 2⁷ is now acting as a value instead of a placeholder. If I had a negative byte before running the OR, then in my final value post-operation, it would now erroneously be acting as though the original number had a positive 2⁷ in it. 0xff doesn't get rid of this, because it preserves the eighth, signed byte, so shouldn't this be a problem?
For example, 1111 (-1) and 0101, when OR'd, might give 01011111. But 1111 wasn't representing POSITIVE 1111, it was representing the signed version; yet in the final answer, it now is acting as a positive 2³.
UPDATE: I marked the accepted answer, but it took that + a little extra work to figure out where I went wrong. For anyone who may read this in the future:
As far as the signing goes, the code I have uses signed bytes. My only guess as to why this doesn't mess anything up is because all of the values received might be of positive sign. Except that this doesn't make sense, given a waveform varies amplitude from [-1,1]. I'm going to play around with this to try and figure it out. If there are negative signs, the implementation of code here doesn't seem to remove the 1 when ORing, so I suspect that it doesn't affect the computation too much (given that we're dealing with really large values (diff += means diff will be really large -- a few extra 1s shouldn't hurt the outcome given the code and the comparisons it relies on. So this was all wrong. I gave it some more thought and it's really simple, actually -- the only reason this was such a problem is because I didn't know about big-endian, and then once I read about it, I misunderstood exactly how it is implemented. Endian-ness explained in the next bulletpoint.
Regarding the order in which the bits are placed, destroying the sound, etc. The code I'm using sets bigEndian=false, meaning that the byte order goes from least significant byte to most significant byte. For this reason, combining the two indices of buffer requires taking the second index, placing its bits first, and placing the first index as second (so we are now in big-endian byte order). One of the problems I had was the impression that "endian-ness" determines the bit order. I thought 10010101 big-endian would become 10101001 small-endian. Turns out this is not the case -- the bits in each byte remain in their original order; the difference is that the bytes are ordered "backward". So 10110101 111000001 big-endian becomes 11100001 10110101 -- same bit order within each byte; however, different byte order.
Finally, I'm not sure why, but the accepted answer is correct: targetDataLine.read() may place the bits into a byte array only (not just in my code, but in all Java code using targetDataLine -- read() only accepts arguments where the destination var is a byte array), but the data is in fact one short split into two bytes. It is for this reason that every two indices must be combined together.
Coming back to the signing goes, it should be obvious by now why this isn't an issue. This is the commenting that I now have in the code, which more coherently explains what it took all of this^ to explain before:
/* The Javadoc explains that the targetDataLine will only read to a byte-typed array.
However, because the sample size is 16-bit, it is actually storing 16-bit numbers
there (shorts), auto-parsing them every eight bits. Additionally, because it is storing
them in little-endian, bits [2^0,2^7] are stored in index[i] in normal order (powers 76543210)
while bits [2^8,2^15] are stored in index[i+1]. So, together they currently read as [7-6-5-4-3-2-1-0 15-14-13-12-11-10-9-8],
which is a problem. In the next for loop, we take care of this and re-organize the bytes by swapping every pair (remember the bits are ok, but the bytes are out of order).
Also, although the array is signed, this will not matter when we combine bytes, because the sign-bit (2^15) will be placed
back at the beginning like it normally is; although 2^7 currently exists as the most significant bit in its byte,
it is not a sign-indicating bit,
because it is really the middle of the short which was split. */
This is combining the byte stream from input in low bytes first byte order to a stream of shorts in internal byte order.
With sign extesion it is more a question of the sign encoding of the original byte stream. If the original byte stream is unsigned (coding values from 0 to 255), then the overcomes the then unwanted effects of java treating values as signed. So educated guess is taht the external byte strem encodes unsigned bytes.
Judging whether the code is plausible needs information on what externel encoding is being treated and what internal encoding is used. E.g. (wild guess could be totally wrong!): the two byte junks read coud belong to 2 channels of a stereo sound encoding and are put into a single short for ease of internal processing. You should look at the encoding being read and the use of the converted data within the application.

XOR Encryption in Java: losing data after decryption

I'm currently writing a very small Java program to implement a one-time-pad, where the pad (or key) itself is generated as a series of bytes using a SecureRandom object, which is seeded using a simple string with the SHA-512 algorithm.
Generating the one-time-pad hasn't caused any problems, and if I supply the same seed string each time, as expected I get the same sequence of psuedo-random numbers, making the decryption process possible as long as the person decrypting has the seed string used to encrypt.
When I try to encrypt a file, the program reads in the data 64 chars at a time (except for the end of file, which is generally an odd number), and generates 64 bytes (or matching amount) of psuedo random bytes. XOR is performed between the elements of both arrays, the resulting char array containing the cipher characters is written to file, and the process repeats until all text in the file has been read.
Now, because Java treats all primitives as signed numbers (the data type byte ranges from -128 to 127, not 0 to 255) this means that the XOR operation can (and does) result in some negative values (-128 to -1). It seems that Java does not recognise these values as valid ASCII, and simply writes a ? (question mark) to the file for any negative values. When it comes to reading from the file to decrypt the cipher text, the negative value that resulted in the ? to be written to file is lost, replaced with 63, the valid ASCII code for a question mark.
This means that XORing this value is useless, without the original value there is no way to produce the plaintext. Incidentally, if I reproduce the behaviour of encrypting some data and then decrypting the data immediately after, in the same program run, and printing status along the way, there are no problems. Only if the data is written to file is the information lost.
I should also mention that I did try adding 128 to each encryption XOR result, and then subtracting it before performing the decryption XOR (to put each value in a valid ASCII range), but the ? problem still showed up because there are 31 ASCII codes from 128 to 159 that I'm unable to read and appear as ?
I've been banging my head off the wall on this for a while now, any help is appreciated.
Cheers.
This is very confused. If you are processing a char array, the elements are 16 bits wide, they are unsigned, and not all values are valid. So (a) you cant possibly be having a problem with signs or bytes, and (b) you shouldn't be doing that at all. You should be reading the file into a byte array, XOR-ing, and writing out the byte array directly to the output file. No Readers or Writers, no chars, no Strings.
I guess the problem is in the way you write the file. Write directly the converted byte array to a FileOutputStream and do not try to convert it to string first. For reading, do the same thing, read it to a byte array.

Categories