Replicating this Java hash in Python

Replicating this Java hash in Python - java

I'm trying to replicate this hashing code in Python, but both languages handles bytes differently and generating very different outputs.
Can someone guide me here ?
Java Code (Original Code)
public static String hash(String filePath, String salt) {
String finalHash = null;
Path path = Paths.get(filePath);
try {
MessageDigest md = MessageDigest.getInstance("SHA-1");
byte[] data = Files.readAllBytes(path);
byte[] dataDigest = md.digest(data);
byte[] hashDigest = md.digest(salt.getBytes("ISO-8859-1"));
byte[] xorBytes = new byte[dataDigest.length];
for (int i = 0; i < dataDigest.length && i < hashDigest.length; i++) {
xorBytes[i] = (byte) (dataDigest[i] << 1 ^ hashDigest[i] >> 1);
}
finalHash = (new HexBinaryAdapter()).marshal(xorBytes);
} catch (IOException | NoSuchAlgorithmException e) {
e.printStackTrace();
}
return finalHash;
}
Python Code (Translated by me)
def generate_hash(file_path: str, salt: bytes) -> str:
with open(file_path, 'rb') as f:
data = f.read()
hashed_file = sha1(data).digest()
hashed_salt = sha1(salt).digest()
xor_bytes = []
for i in range(len(hashed_file)):
xor_bytes.append((hashed_file[i] << 1 ^ hashed_salt[i] >> 1))
return ''.join(map(chr, xor_bytes)) # This is probably not equivalent of HexBinaryAdapter

There are the following issues:
The shift operations are wrongly implemented in the Python code:
In the Python code the generated hash is stored in a bytes-like object as a list of unsigned integer values between 0 and 255 [1], e.g. 0xc8 = 11001000 = 200. In Java, integers are stored as signed values, whereby the two's complement is used to represent negative numbers [2][3]. The value 0x8c would be interpreted as -56 if stored in a byte variable.
The >>-operator produces a different result on the binary level for signed and unsigned values, because it is an arithmetic shift operator which preserves the sign [4][5][6]. Example:
signed -56 >> 1 = 1110 0100 = -28
unsigned 200 >> 1 = 0110 0100 = 100
The <<-operator, on the other hand, does not cause the above problem, but can lead to values that cannot be represented by a byte. Example:
signed -56 << 1 = 1 1001 0000 = -112
unsigned 200 << 1 = 1 1001 0000 = 400
For these reasons, in the Python code the following line
xor_bytes.append((hashed_file[i] << 1 ^ hashed_salt[i] >> 1))
has to be replaced by
xor_bytes.append((hashed_file[i] << 1 ^ tc(hashed_salt[i]) >> 1) & 0xFF)
where
def tc(val):
if val > 127:
val = val - 256
return val
determines the negative value of the two's complement representation (or more sophisticated with bitwise operators see [7]).
The use of the bitwise and (&) with 0xFF ensures that only the relevant byte is taken into account in the Python code, analogous to the Java code [5].
There are several ways to convert the list/bytes-like object into a hexadecimal string (as in the Java code), e.g. with [8][9]
bytes(xor_bytes).hex()
or with [8][10] (as binary string)
binascii.b2a_hex(bytes(xor_bytes))
In the Python code the encoding of the salt must be taken into account. Since the salt is already passed as a binary string (in the Java code it is passed as a string), the encoding must be performed before the function is called:
saltStr = 'MySalt'
salt = saltStr.encode('ISO-8859-1')
For a functional consistency with the Java code, the salt would have to be passed as a string and the encoding would have to be performed within the function.

Related

What does & 0xff do And MD5 Structure?

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class JavaMD5 {
public static void main(String[] args) {
String passwordToHash = "MyPassword123";
String generatedPassword = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(passwordToHash.getBytes());
byte[] bytes = md.digest();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < bytes.length; i++) {
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
generatedPassword = sb.toString();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(generatedPassword);
}
}
This line is the problem :
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
what does each part do in this structure????
Thanks and I'm sorry for asking Beacuse I'm new in java.

Presumably most of the code is clear and the only mystery for you here is this expression:
(bytes[i] & 0xff) + 0x100
The first part:
bytes[i] & 0xff
widens the byte at position i to an int value with zeros in bit positions 8-31. In Java, the byte data type is a signed integer value, so the widening sign-extends the value. Without the & 0xff, values greater than 0x7f would end up as negative int values. The rest is then fairly obvious: it adds 0x100, which simply turns on the bit at index 8 (since it is guaranteed to be 0 in (bytes[i] & 0xff). It is then converted to a hex String value by the call to Integer.toString(..., 16).
The reason for first adding 0x100 and then stripping off the 1 (done by the substring(1) call, which takes the substring starting at position 1 through the end) is to guarantee two hex digits in the end result. Otherwise, byte values below 0x10 would end up as one-character strings when converted to hex.
It's debatable whether all that has better performance (it certainly isn't clearer) than:
sb.append(String.format("%02x", bytes[i]));

It's a really messy way of translating to a hexadecimal string.
& 0xFF performs a binary AND, causing the returning value to be between 0 and 255 (which a byte always is anyway)
+ 0x100 adds 256 to the result to ensure the result is always 3 digits
Integer.toString(src, 16) converts the integer to a string with helix 16 (hexadecimal)
Finally .substring(1) strips the first character (the 1 from step 2)
So, this is a very elaborate and obfuscated way to convert a byte to an always 2-character hexadecimal string.

Java converting int to hex and back again

I have the following code...
int Val=-32768;
String Hex=Integer.toHexString(Val);
This equates to ffff8000
int FirstAttempt=Integer.parseInt(Hex,16); // Error "Invalid Int"
int SecondAttempt=Integer.decode("0x"+Hex); // Error "Invalid Int"
So, initially, it converts the value -32768 into a hex string ffff8000, but then it can't convert the hex string back into an Integer.
In .Net it works as I'd expect, and returns -32768.
I know that I could write my own little method to convert this myself, but I'm just wondering if I'm missing something, or if this is genuinely a bug?

int val = -32768;
String hex = Integer.toHexString(val);
int parsedResult = (int) Long.parseLong(hex, 16);
System.out.println(parsedResult);
That's how you can do it.
The reason why it doesn't work your way: Integer.parseInt takes a signed int, while toHexString produces an unsigned result. So if you insert something higher than 0x7FFFFFF, an error will be thrown automatically. If you parse it as long instead, it will still be signed. But when you cast it back to int, it will overflow to the correct value.

It overflows, because the number is negative.
Try this and it will work:
int n = (int) Long.parseLong("ffff8000", 16);

int to Hex :
Integer.toHexString(intValue);
Hex to int :
Integer.valueOf(hexString, 16).intValue();
You may also want to use long instead of int (if the value does not fit the int bounds):
Hex to long:
Long.valueOf(hexString, 16).longValue()
long to Hex
Long.toHexString(longValue)

It's worth mentioning that Java 8 has the methods Integer.parseUnsignedInt and Long.parseUnsignedLong that does what you wanted, specifically:
Integer.parseUnsignedInt("ffff8000",16) == -32768
The name is a bit confusing, as it parses a signed integer from a hex string, but it does the work.

Try using BigInteger class, it works.
int Val=-32768;
String Hex=Integer.toHexString(Val);
//int FirstAttempt=Integer.parseInt(Hex,16); // Error "Invalid Int"
//int SecondAttempt=Integer.decode("0x"+Hex); // Error "Invalid Int"
BigInteger i = new BigInteger(Hex,16);
System.out.println(i.intValue());

As Integer.toHexString(byte/integer) is not working when you are trying to convert signed bytes like UTF-16 decoded characters you have to use:
Integer.toString(byte/integer, 16);
or
String.format("%02X", byte/integer);
reverse you can use
Integer.parseInt(hexString, 16);

Java's parseInt method is actally a bunch of code eating "false" hex : if you want to translate -32768, you should convert the absolute value into hex, then prepend the string with '-'.
There is a sample of Integer.java file :
public static int parseInt(String s, int radix)
The description is quite explicit :
* Parses the string argument as a signed integer in the radix
* specified by the second argument. The characters in the string
...
...
* parseInt("0", 10) returns 0
* parseInt("473", 10) returns 473
* parseInt("-0", 10) returns 0
* parseInt("-FF", 16) returns -255

Using Integer.toHexString(...) is a good answer. But personally prefer to use String.format(...).
Try this sample as a test.
byte[] values = new byte[64];
Arrays.fill(values, (byte)8); //Fills array with 8 just for test
String valuesStr = "";
for(int i = 0; i < values.length; i++)
valuesStr += String.format("0x%02x", values[i] & 0xff) + " ";
valuesStr.trim();

Below code would work:
int a=-32768;
String a1=Integer.toHexString(a);
int parsedResult=(int)Long.parseLong(a1,16);
System.out.println("Parsed Value is " +parsedResult);

Hehe, curious. I think this is an "intentianal bug", so to speak.
The underlying reason is how the Integer class is written. Basically, parseInt is "optimized" for positive numbers. When it parses the string, it builds the result cumulatively, but negated. Then it flips the sign of the end-result.
Example:
66 = 0x42
parsed like:
4*(-1) = -4
-4 * 16 = -64 (hex 4 parsed)
-64 - 2 = -66 (hex 2 parsed)
return -66 * (-1) = 66
Now, let's look at your example
FFFF8000
16*(-1) = -16 (first F parsed)
-16*16 = -256
-256 - 16 = -272 (second F parsed)
-272 * 16 = -4352
-4352 - 16 = -4368 (third F parsed)
-4352 * 16 = -69888
-69888 - 16 = -69904 (forth F parsed)
-69904 * 16 = -1118464
-1118464 - 8 = -1118472 (8 parsed)
-1118464 * 16 = -17895552
-17895552 - 0 = -17895552 (first 0 parsed)
Here it blows up since -17895552 < -Integer.MAX_VALUE / 16 (-134217728).
Attempting to execute the next logical step in the chain (-17895552 * 16)
would cause an integer overflow error.
Edit (addition): in order for the parseInt() to work "consistently" for -Integer.MAX_VALUE <= n <= Integer.MAX_VALUE, they would have had to implement logic to "rotate" when reaching -Integer.MAX_VALUE in the cumulative result, starting over at the max-end of the integer range and continuing downwards from there. Why they did not do this, one would have to ask Josh Bloch or whoever implemented it in the first place. It might just be an optimization.
However,
Hex=Integer.toHexString(Integer.MAX_VALUE);
System.out.println(Hex);
System.out.println(Integer.parseInt(Hex.toUpperCase(), 16));
works just fine, for just this reason. In the sourcee for Integer you can find this comment.
// Accumulating negatively avoids surprises near MAX_VALUE

Which SHA-256 is correct? The Java SHA-256 digest or the Linux commandline tool

When I calculate in Java an SHA-256 of a string with the following method
public static void main(String[] args) throws NoSuchAlgorithmException {
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] hash = md.digest("password".getBytes());
StringBuffer sb = new StringBuffer();
for(byte b : hash) {
sb.append(Integer.toHexString(b & 0xff));
}
System.out.println(sb.toString());
}
I get :
5e884898da2847151d0e56f8dc6292773603dd6aabbdd62a11ef721d1542d8
on the commandline I do the following (I need the -n to not add a newline) :
echo -n "password" | sha256sum
and get
5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
if we compare these more closely I find 2 subtle differences
5e884898da2847151d0e56f8dc6292773603dd6aabbdd62a11ef721d1542d8
5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
or :
5e884898da28 47151d0e56f8dc6292773603d d6aabbdd62a11ef721d1542d8
5e884898da28 0 47151d0e56f8dc6292773603d 0 d6aabbdd62a11ef721d1542d8
Which of the 2 is correct here?
Result: Both are but I was wrong...
fixed it by using :
StringBuffer sb = new StringBuffer();
for(byte b : hash) {
sb.append(String.format("%02x", b));
}
Thanks!

I'll take a reasonable guess: both are outputting the same digest, but in your Java code that outputs the byte[] result as a hex string, you outputting small byte values (less than 16) without a leading 0. So a byte with value "0x0d" is being written as "d" not "0d".

The culprit is the toHexString. It appears to be outputting 6 for the value 6 whereas the sha256sum one is outputting 06. The Java docs for Integer.toHexString() state:
This value is converted to a string of ASCII digits in hexadecimal (base 16) with no extra leading 0s.
The other zeros in the string aren't being affected since they're the second half of the bytes (e.g., 30).
One way to fix it would be to change:
for(byte b : hash) {
sb.append(Integer.toHexString(b & 0xff));
}
to:
for(byte b : hash) {
if (b < 16) sb.append("0");
sb.append(Integer.toHexString(b & 0xff));
}

They're both right - it's your Java code that is at fault, because it is not printing out the leading 0 for a hex value less than 0x10.

You still need "echo -n" to prevent the trailing \n

The one generated by sha256sum seems correct. Your implementation seems to drop those two zeroes.

Using #paxdiablo idea had problem with big number as it appear as negative, so
Instead of:
for(byte b : hash) {
sb.append(Integer.toHexString(b & 0xff));
}
you could do:
for(byte b : hash) {
if (b > 0 && b < 16) {
sb.append("0");
}
sb.append(Integer.toHexString(b & 0xff));
}
And read #Sean Owen answer.

You can also get the right result using this:
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] hash = md.digest("password".getBytes());
BigInteger bI = new BigInteger(1, hash);
System.out.println(bI.toString(16));

Take most significant 8 bytes of the MD5 hash of a string as a long (in Ruby)

Hey Friends, I'm trying to implement a java "hash" function in ruby.
Here's the java side:
import java.nio.charset.Charset;
import java.security.MessageDigest;
/**
* #return most significant 8 bytes of the MD5 hash of the string, as a long
*/
protected long hash(String value) {
byte[] md5hash;
md5hash = md5Digest.digest(value.getBytes(Charset.forName("UTF8")));
long hash = 0L;
for (int i = 0; i < 8; i++) {
hash = hash << 8 | md5hash[i] & 0x00000000000000FFL;
}
return hash;
}
So far, my best guess in ruby is:
# WRONG - doesn't work properly.
#!/usr/bin/env ruby -wKU
require 'digest/md5'
require 'pp'
md5hash = Digest::MD5.hexdigest("0").unpack("U*")
pp md5hash
hash = 0
0.upto(7) do |i|
hash = hash << 8 | md5hash[i] & 0x00000000000000FF
end
pp hash
Problem is, this ruby code doesn't match the java output.
For reference, the above java code given these strings returns the corresponding long:
"00038c53790ecedfeb2f83102e9115a522475d73" => -2059313900129568948
"0" => -3473083983811222033
"001211e8befc8ac22dd265ecaa77f8c227d0007f" => 3234260774580957018
Thoughts:
I'm having problems getting the UTF8 bytes from the ruby string
In ruby I'm using hexdigest, I suspect I should be using just digest instead
The java code is taking the md5 of the UTF8 bytes whereas my ruby code is taking the bytes of the md5 (as hex)
Any suggestions on how to get the exact same output in ruby?

require 'digest/md5'
class String
def my_hash
hi1, hi2, mid, lo = *Digest::MD5.digest(self).unpack('cCnN')
hi1 << 56 | hi2 << 48 | mid << 32 | lo
end
end
require 'test/unit'
class TestMyHash < Test::Unit::TestCase
def test_that_my_hash_hashes_the_string_0_to_negative_3473083983811222033
assert_equal -3473083983811222033, '0'.my_hash
end
end

BaseX XML database code

I'm a student of computer science and we have to use BaseX (a pure Java OSS XML database) in one of our courses. While browsing through the code I discovered the following piece of code:
/**
* Returns a md5 hash.
* #param pw password string
* #return hash
*/
public static String md5(final String pw) {
try {
final MessageDigest md = MessageDigest.getInstance("MD5");
md.update(Token.token(pw));
final TokenBuilder tb = new TokenBuilder();
for(final byte b : md.digest()) {
final int h = b >> 4 & 0x0F;
tb.add((byte) (h + (h > 9 ? 0x57 : 0x30)));
final int l = b & 0x0F;
tb.add((byte) (l + (l > 9 ? 0x57 : 0x30)));
}
return tb.toString();
} catch(final Exception ex) {
Main.notexpected(ex);
return pw;
}
}
(source: https://svn.uni-konstanz.de/dbis/basex/trunk/basex/src/main/java/org/basex/util/Token.java)
Just out of interest: what is happening there? Why these byte operations after the MD5? The docstring is saying it returns a MD5 hash...does it?

I didn't look up the definitions for the classes used, but the byte operations seem to be encoding the returned byte array into a string of hex characters.
for(final byte b : md.digest()) {
// get high 4 bytes of current byte
final int h = b >> 4 & 0x0F;
// convert into hex digit (0x30 is '0' while 0x57+10 is 'a')
tb.add((byte) (h + (h > 9 ? 0x57 : 0x30)));
// the same for the bottom 4 bits
final int l = b & 0x0F;
tb.add((byte) (l + (l > 9 ? 0x57 : 0x30)));
}
This is a great example of why using magic numbers is bad. I, for one, honestly couldn't remember that 0x57+10 is the ASCII/Unicode codepoint for 'a' without checking it in a Python interpreter.

I guess Matti is right - as the md.digest() returns an byte[] and BaseX uses Tokens in favor of Strings (thus the TokenBuilder).
So the conversion from md.digest() to String is done via a conversion of Digest-Hex to Token.
Not exactly easy to read but quite similar to what Apache Commons does in their Codec Library
to get the String value of a md5 hash.

This is a great example of why using magic numbers is bad.
Well, this is a core method, which isn't supposed to be modified by others – and this looks like the most efficient way to do it. But, true, the documentation could be better. Talking about core methods, it's worthwhile looking at code like Integer.getChars():
http://www.docjar.com/html/api/java/lang/Integer.java.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replicating this Java hash in Python - java

Related

What does & 0xff do And MD5 Structure?

Java converting int to hex and back again

Which SHA-256 is correct? The Java SHA-256 digest or the Linux commandline tool

Take most significant 8 bytes of the MD5 hash of a string as a long (in Ruby)

BaseX XML database code

Categories

Resources