Murmurhash3 between Java and C++ is not aligning

Murmurhash3 between Java and C++ is not aligning - java

I have 2 separate applications one in Java and the other is C++. I am using Murmurhash3 for both. However, in C++ I get a different result as compared to Java for the same string
Here is the one from C++: https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp?r=144
I am using the following function:
void MurmurHash3_x86_32 ( const void * key, int len,
uint32_t seed, void * out )
Here is the one for Java: http://search-hadoop.com/c/HBase:hbase-common/src/main/java/org/apache/hadoop/hbase/util/MurmurHash3.java||server+void+%2522hash
There are many versions of the same Java code above.
This is how I am making a call for Java:
String s = new String("b2622f5e1310a0aa14b7f957fe4246fa");
System.out.println(MurmurHash3.murmurhash3_x86_32(s.getBytes(), 0, s.length(), 2147368987));
The output I get from Java:
-1868221715
The output I get from C++
3297211900
When I tested for some other sample strings like
"7c6c5be91430a56187060e06fd64dcb8" and "7e7e5f2613d0a2a8c591f101fe8c7351" they match in Java and C++.
Any pointers are appreciated

There are two problems I can see. First, C++ is using uint32_t, and giving you a value of 3,297,211,900. This number is larger than can fit in a signed 32-bit int, and Java uses only signed integers. However, -1,868,221,715 is not equal to 3,297,211,900, even accounting for the difference between signed and unsigned ints.
(In Java 8 they have added Integer.toUnsignedString(int), which will convert a signed 32-bit int to its unsigned string representation. In earlier versions of Java, you can cast the int to a long and then mask off the high bits: ((long) i) & 0xffffffffL.)
The second problem is that you are using the wrong version of getBytes(). The one that takes no argument converts a Unicode String to a byte[] using the default platform encoding, which may vary depending on how your system is set up. It could be giving you UTF-8, Latin1, Windows-1252, KOI8-R, Shift-JIS, EBCDIC, etc.
Never, ever, ever call the no arguments version of String.getBytes(), under any circumstances. It should be deprecated, decimated, defenestrated, destroyed, and deleted.
Use s.getBytes("UTF-8") (or whatever encoding you're expecting to get) instead.
As the Zen of Python says, "Explicit is better than implicit."
I can't tell if there may be any other problems beyond these two.

I had the same problem with you. But the Java version of my Murmurhash3 is different from yours. After making some changes to the C++ version of Murmurhash3, I made the hash values generated from the two versions the same. I give you my solution, which you can use to check if it also works for you.Maybe the biggest difference between the Java and C++ versions lies in the right shift operation(in Java you can see >> and >>>, while in C++ you can only see >>). The integers in Java are all signed, while in C++ you can use signed or unsigned integers. In Java version, the >> means arithmetic right shift and the >>> means logical right shift. And in C++, the >> means arithmetic right shift. The original C++ version of Murmurhash3 uses unsigned integer, and in order to generate the negative hash value like in Java, first in C++ you should change all the unsigned type uint32_t to signed type int32_t. Then you should locate the >>> in Java and make changes around the corresponding >> in C++. For me, I change from：
inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}
to:
inline int32_t rotl32 ( int32_t x, int8_t r )
{
return (x << r) | (int32_t)((uint32_t)x >> (32 - r)); //similar to >>> in Java
}
and from:
FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
to:
FORCE_INLINE int32_t fmix32 ( int32_t h )
{
h ^= (int32_t)((uint32_t)h >> 16); // similar to >>> in Java
h *= 0x85ebca6b;
h ^= (int32_t)((uint32_t)h >> 13);
h *= 0xc2b2ae35;
h ^= (int32_t)((uint32_t)h >> 16);
return h;
}
In this way, my two versions of Murmurhash3 in Java and C++ generate the same hash value.

Related

Fixing "incompatible types: possible lossy conversion from int to byte" in Java

I have question regarding my code here:
public class Main
{
public static void main(String[] args) {
System.out.println("Hello World\n");
int x = 36;
byte b1 = ((byte) x) & ((byte) 0xff); // it seems it is the part after &, but I have 0xff cast to byte by using (byte)0xff, so not sure where exactly the error is coming from.
System.out.println(b1);
}
}
I am not sure exactly which part is causing the error of:
incompatible types: possible lossy conversion from int to byte
This is the error message output from the program:

You appear to be confused.
There is no point in your code. taking any number, calculating that & 0xFF, and then storing it in a byte, is always a noop - it does nothing.
You additionally get an error because & inherently always produces at least an int (it'll upcast anything smaller to match), so you're trying to assign an int to a byte.
What are you trying to accomplish?
"I want to have my byte be unsigned"!
No can do. Java doesn't have unsigned bytes. A java byte is signed. Period. It can hold a value between -128 and +127. For calculation purposes, -128 and 255 are identical (they are both the bit sequence 1111 1111 - in hex, 0xFF, and they act identically under all relevant arithmetic, though it does get tricky when converting them to another numeric type int).
"I just want to store 255"!
Then use int. This is where most & 0xFF you'll ever see in java code comes from: When you have a byte value which java inherently treats as signed, but you wish to treat it as unsigned and, therefore (given that in java bytes can't do that), you want to upcast it to an int, containing the unsigned representation. This is how to do that:
int x = y & 0xFF;
Where y is any byte.
You presumably saw this somewhere and are now trying to apply it, but assigning the result of y & 0xFF to a byte doesn't mean anything. You'd assign it to an int variable, or just use it as expression in a further calculation (y & 0xFF is an int - make sure you add the appropriate parentheses, & has perhaps unexpected precedence).
int x = 36;
byte b1 = ((byte) x) & ((byte) 0xff);
Every imaginable way of this actually working would mean that b1 is... still 36.

To compute x & y where the two operands are bytes, they must first be promoted to int values. There is no & between bytes. The result is therefore of type int
That is, what you wrote is effectively evaluated as if you'd written it as the following, making explicit what the language gives you implicitly:
byte b1 = ((int) (byte) x) & ((int) (byte) 0xff);
Just do the arithmetic and then cast the result to byte.
byte b1 = (byte)(x & 0xff);
Link to Java Language Specification
Edited to add, thanks to #rzwitserloot, that masking a byte value with 0xff is however pointless. If you need the assignment from an integer to a byte, just write the cast:
byte b1 = (byte)x;

Get the same shift left in Python as Java

Specifically I want to take this number:
x = 1452610545672622396
and perform
x ^= (x << 21) // In Python I do x ^= (x << 21) & 0xffffffffffffffff
I want to get: -6403331237455490756, which is what I get in Java
instead of: 12043412836254060860, which is what I get in Python (which is what I don't want)
EDIT: In Java I do:
long x = 1452610545672622396;
x ^= (x << 21);

You can use 64bit signed int like java using ctypes.c_longlong, please see example below:
from ctypes import c_longlong as ll
x = 1452610545672622396
output = ll(x^(x<<21))
print output
print output.__class__

You might cause an overflow. Java long is 64 bit long while python has no size limit. Try using Long wrapper class of long. The Long object has also no limits (Well technicially everything has its limits...).

Sign extension, bit shifting in JAVA. Help understanding a C-code bit

I have the following C-code (from FFMPEG):
static inline av_const int sign_extend(int val, unsigned bits)
{
unsigned shift = 8 * sizeof(int) - bits;
union { unsigned u; int s; } v = { (unsigned) val << shift };
return v.s >> shift;
}
I'm trying to reproduce this in JAVA. But I have difficulties understanding this. No matter how I toss the bits around, I don't get very close.
As for the value parameter: it takes unsigned byte value as int.
Bits parameter: 4
If the value is 255 and bits is 4. It returns -1. I can't reproduce this in JAVA. Sorry for such fuzzy question. But can you help me understand this code?
The big picture is that I'm trying to encode EA ADPCM audio in JAVA. In FFMPEG:
https://gitorious.org/ffmpeg/ffmpeg/source/c60caa5769b89ab7dc4aa41a21f87d5ee177bd30:libavcodec/adpcm.c#L981

Strictly speaking, the result of running this code with this input data has unspecified results because signed bitshift in C is only properly defined in circumstances that this scenario does not meet. From the C99 standard:
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has unsigned type or if E1 has signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2E2. If E1 has a signed type and negative value, the resulting value is implementation-defined.
(Emphasis mine)
But let's assume that your implementation defines signed right shift to extend the sign, meaning that the space on the left will be filled with ones if the sign bit is set and zeroes otherwise; the ffmpeg code clearly expects this to be the case. The following is happening: shift has the value of 28 (assuming 32-bit integers). In binary notation:
00000000 00000000 00000000 11111111 = val
11110000 00000000 00000000 00000000 = (unsigned) val << shift
Note that when interpreting (unsigned) val << shift as signed integer, as the code proceeds to do (assuming two's complement representation, as today's computers all use1), the sign bit of that integer is set, so a signed shift to the right fills up with zeroes from the left, and we get
11110000 00000000 00000000 00000000 = v.s
11111111 11111111 11111111 11111111 = v.s >> shift
...and in two's complement representation, that is -1.
In Java, this trick works the same way -- except better, because there the behavior is actually guaranteed. Simply:
public static int sign_extend(int val, int bits) {
int shift = 32 - bits; // int always has 32 bits in Java
int s = val << shift;
return s >> shift;
}
Or, if you prefer:
public static int sign_extend(int val, int bits) {
int shift = 32 - bits;
return val << shift >> shift;
}
1 Strictly speaking, this conversion does not have a well-defined value in the C standard either, for historical reasons. There used to be computers that used different representations, and the same bit pattern with a set sign bit has a completely different meaning in (for example) signed magnitude representation.

The reason why the code looks so odd is that C language is full of undefined behaviours that in Java are well-defined. For example in C bit-shifting a signed integer left so that the sign-bit changes is undefined behaviour and at that point the program can do anything - whatever the compiler causes the program to do - crash, print 42, make true = false, anything can happen, and the compiler still compiled it correctly.
Now the code uses a 1 trick to shift the integer left: it uses an union that lays the bytes of members top of each other - making an unsigned and an signed integer to occupy the same bytes; the bitshift is defined with the unsigned integer; so we do the unsigned shift using it; then shift back using the signed shift (the code assumes that the right shift of a negative number produces properly sign-extended negative numbers, which is also not guaranteed by standard but usually these kinds of libraries have a configuration utility that can refuse compilation on such a quite esoteric platform; likewise this program assumes that CHAR_BIT is 8; however C only makes a guarantee that a char is at least 8 bits wide.
In Java, you do not need anything like a union to accomplish this; instead you do:
static int signExtend(int val, int bits) {
int shift = 32 - bits; // fixed size
int v = val << shift;
return v >> shift;
}
In Java the width of an int is always 32 bits; << can be used for both signed and unsigned shift; and there is no undefined behaviour for extending to the sign bit; >> can be used for signed shift (>>> would be unsigned).

given this code:
static inline av_const int sign_extend(int val, unsigned bits)
{
unsigned shift = 8 * sizeof(int) - bits;
union { unsigned u; int s; } v = { (unsigned) val << shift };
return v.s >> shift;
}
the 'static' modifier says the function is not visible outside the current file.
The 'inline' modifier is a 'request' to the compiler to place the code 'inline' whereever the function is called rather than having a separate function with the associated call/return code sequences
the 'sign_extend' is the name of the function
in C, a right shift, for a signed value will propagate the sign bit,
In C, a right shift, for a unsigned value will zero fill.
It looks like your java is doing the zero fill.
regarding this line:
unsigned shift = 8 * sizeof(int) - bits;
on a 32bit machine, an integer is 32 bits and size of int is 4
so the variable 'shift' will contain (8*4)-bits
regarding this line:
union { unsigned u; int s; } v = { (unsigned) val << shift };
left shift of unsigned will shift the bits left,
with the upper bits being dropped into the bit bucket
and the lower bits being zero filled.
regarding this line:
return v.s >> shift;
this shifts the bits back to their original position,
while propagating the (new) sign bit

What is the equivalent of unsigned long in java

i have written these following three functions for my project to work:
WORD shuffling(WORD x)
{
// WORD - 4 bytes - 32 bits
//given input - a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15- b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15
//output required - a0,b0,a1,b1,a2,b2,a3,b3,a4,b4,a5,b5,a6,b6,a7,b7 - a8,b8,a9,b9,a10,b10,a11,b11,a12,b12,a13,b13,a14,b14,a15,b15
x = (x & 0X0000FF00) << 8 | (x >> 8) & 0X0000FF00 | x & 0XFF0000FF;
x = (x & 0X00F000F0) << 4 | (x >> 4) & 0X00F000F0 | x & 0XF00FF00F;
x = (x & 0X0C0C0C0C) << 2 | (x >> 2) & 0X0C0C0C0C | x & 0XC3C3C3C3;
x = (x & 0X22222222) << 1 | (x >> 1) & 0X22222222 | x & 0X99999999;
return x;
}
WORD t_function(WORD n)
{
WORD t_result=0;
WORD64 var = 2*((n*n)& 0xFFFFFFFF)+n; // (n*n mod FFFFFFFF) becomes a 32-bit word
t_result = (WORD) ((var)& 0xFFFFFFFF);
return t_result;
}
WORD lfsr(WORD t_result)
{
WORD returnValue = t_result;
WORD flag = 0;
flag = returnValue & 0x80000000; // Checking if MSB is 1 or 0
// Left shift the input
returnValue = returnValue << 1;
// If MSB is 1 then XOR the reult with the primitive polynomial
if(flag > 0)
{
returnValue = returnValue ^ 0x4C11DB7;
}
return returnValue;
}
WORD - unsigned long
this code is in "c". Now i have to implement this in java. Everything is fine in compiling and running the code. But here i used unsigned long and in java i have used int Since i am operating on 32bits at a time. The problem is "when implementing in java whenever the result is going out of range of int the output is being deviated and it will not be the same output from the c code. Is there any solution for my problem to replace the unsigned long range values in java

Update – Java 8 can treat signed int & long as if unsigned
In Java, the primitive integer data types (byte, short, int, and long) are signed (positive or negative).
As of Java 8 both int and long can be treated explicitly as if they are unsigned. Officially a feature now, but kind of a hack nonetheless. Some may find it useful in certain limited circumstances. See the Java Tutorial.
int: By default, the int data type is a 32-bit signed two's complement integer, which has a minimum value of -2³¹ and a maximum value of 2³¹-1. In Java SE 8 and later, you can use the int data type to represent an unsigned 32-bit integer, which has a minimum value of 0 and a maximum value of 2³²-1. Use the Integer class to use int data type as an unsigned integer. See the section The Number Classes for more information. Static methods like compareUnsigned, divideUnsigned etc have been added to the Integer class to support the arithmetic operations for unsigned integers.
long: The long data type is a 64-bit two's complement integer. The signed long has a minimum value of -2⁶³ and a maximum value of 2⁶³-1. In Java SE 8 and later, you can use the long data type to represent an unsigned 64-bit long, which has a minimum value of 0 and a maximum value of 2⁶⁴-1. The unsigned long has a minimum value of 0 and maximum value of 2⁶⁴-1. Use this data type when you need a range of values wider than those provided by int. The Long class also contains methods like compareUnsigned, divideUnsigned etc to support arithmetic operations for unsigned long.
I am not necessarily recommending this approach. I’m merely making you aware of the option.

Short answer, there's no unsigned data type in java. long in C is 32-bit on 32-bit systems, but java's long is 64-bit, so you can use that for replacement (at least it would solve the overflow problem). If you need even wider integers, use BigInteger class.

Look over Java's Primitive Data Types. If you need something bigger than a long, try a BigInteger.

Converting Little Endian to Big Endian

All,
I have been practicing coding problems online. Currently I am working on a problem statement Problems where we need to convert Big Endian <-> little endian. But I am not able to jot down the steps considering the example given as:
123456789 converts to 365779719
The logic I am considering is :
1 > Get the integer value (Since I am on Windows x86, the input is Little endian)
2 > Generate the hex representation of the same.
3 > Reverse the representation and generate the big endian integer value
But I am obviously missing something here.
Can anyone please guide me. I am coding in Java 1.5

Since a great part of writing software is about reusing existing solutions, the first thing should always be a look into the documentation for your language/library.
reverse = Integer.reverseBytes(x);
I don't know how efficient this function is, but for toggling lots of numbers, a ByteBuffer should offer decent performance.
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
...
int[] myArray = aFountOfIntegers();
ByteBuffer buffer = ByteBuffer.allocate(myArray.length*Integer.BYTES);
buffer.order(ByteOrder.LITTLE_ENDIAN);
for (int x:myArray) buffer.putInt(x);
buffer.order(ByteOrder.BIG_ENDIAN);
buffer.rewind();
int i=0;
for (int x:myArray) myArray[i++] = buffer.getInt(x);
As eversor pointed out in the comments, ByteBuffer.putInt() is an optional method, and may not be available on all Java implementations.
The DIY Approach
Stacker's answer is pretty neat, but it is possible to improve upon it.
reversed = (i&0xff)<<24 | (i&0xff00)<<8 | (i&0xff0000)>>8 | (i>>24)&0xff;
We can get rid of the parentheses by adapting the bitmasks. E. g., (a & 0xFF)<<8 is equivalent to a<<8 & 0xFF00. The rightmost parentheses were not necessary anyway.
reversed = i<<24 & 0xff000000 | i<<8 & 0xff0000 | i>>8 & 0xff00 | i>>24 & 0xff;
Since the left shift shifts in zero bits, the first mask is redundant. We can get rid of the rightmost mask by using the logical shift operator, which shifts in only zero bits.
reversed = i<<24 | i>>8 & 0xff00 | i<<8 & 0xff0000 | i>>>24;
Operator precedence here, the gritty details on shift operators are in the Java Language Specification

Check this out
int little2big(int i) {
return (i&0xff)<<24 | (i&0xff00)<<8 | (i&0xff0000)>>8 | (i>>24)&0xff;
}

The thing you need to realize is that endian swaps deal with the bytes that represent the integer. So the 4 byte number 27 looks like 0x0000001B. To convert that number, it needs to go to 0x1B000000... With your example, the hex representation of 123456789 is 0x075BCD15 which needs to go to 0x15CD5B07 or in decimal form 365779719.
The function Stacker posted is moving those bytes around by bit shifting them; more specifically, the statement i&0xff takes the lowest byte from i, the << 24 then moves it up 24 bits, so from positions 1-8 to 25-32. So on through each part of the expression.
For example code, take a look at this utility.

Java primitive wrapper classes support byte reversing since 1.5 using reverseBytes method.
Short.reverseBytes(short i)
Integer.reverseBytes(int i)
Long.reverseBytes(long i)
Just a contribution for those who are looking for this answer in 2018.

I think this can also help:
int littleToBig(int i)
{
int b0,b1,b2,b3;
b0 = (i&0x000000ff)>>0;
b1 = (i&0x0000ff00)>>8;
b2 = (i&0x00ff0000)>>16;
b3 = (i&0xff000000)>>24;
return ((b0<<24)|(b1<<16)|(b2<<8)|(b3<<0));
}

Just use the static function (reverseBytes(int i)) in java which is under Integer Wrapper class
Integer i=Integer.reverseBytes(123456789);
System.out.println(i);
output:
365779719

the following method reverses the order of bits in a byte value:
public static byte reverseBitOrder(byte b) {
int converted = 0x00;
converted ^= (b & 0b1000_0000) >> 7;
converted ^= (b & 0b0100_0000) >> 5;
converted ^= (b & 0b0010_0000) >> 3;
converted ^= (b & 0b0001_0000) >> 1;
converted ^= (b & 0b0000_1000) << 1;
converted ^= (b & 0b0000_0100) << 3;
converted ^= (b & 0b0000_0010) << 5;
converted ^= (b & 0b0000_0001) << 7;
return (byte) (converted & 0xFF);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Murmurhash3 between Java and C++ is not aligning - java

Related

Fixing "incompatible types: possible lossy conversion from int to byte" in Java

Get the same shift left in Python as Java

Sign extension, bit shifting in JAVA. Help understanding a C-code bit

What is the equivalent of unsigned long in java

Converting Little Endian to Big Endian

Categories

Resources