Can we store integers in less than 4 bytes?

Can we store integers in less than 4 bytes? - java

I understand that using a short in Java we can store a minimum value of -32,768 and a maximum value of 32,767 (inclusive).
And using an int we can store a minimum value of -2^31 and a maximum value of 2^31-1
Question: If I have an int[] numbers and the numbers I can store are positive up to 10Million.
Is it possible to somehow store these numbers without having to use 4 bytes for each? I am wondering if for a specific "small" range there might be some "hack/trick" so that I could use less memory than numbers.length*4

You could attempt to use a smaller number of bits by using masking or bit-operations to represent each number, and then perform a sign-extension later on if you wish to get the full number of bits. This kind of operation is done on a system-architecture level in nearly all computer systems today.
It may help you to research 2's Complement, which seems to be what you are going for... And possibly Sign Extension for good measure.
Typically, in high-level languages an int is represented by the basic size of the processor register. ex) 8, 16, 32, or 64-bits.
If you use a 2's-Complement method, you could easily account for the full spectrum of positive and negative numbers if needed. This is also very easy on the hardware, because you only have to invert all the bits and then add 1, which may prove to give you a big performance increase over other possible methods.
How 2's Complement Works:
Get –N by inversing all bits and then
add 1
That is, get 1-complement of N and then add 1 to it.
For example with 8-bit words:
9 = 00001001
-9 = 11110111 (11110110 + 1)
Easily and efficiently in hardware
(inverse and then +1)
• An n-bit word can be used to represent numbers
from -2^(N-1) to +(2^(N-1) - 1)
UPDATE: Bit-operations to represent larger numbers.
If you are trying to get a larger number, say 1,000,000 as in your comment, then you can use a Bitwise left-shift operation to then extract the number by increasing your current number by the appropriate power of 2.
9 (base 10): 00000000000000000000000000001001 (base 2)
--------------------------------
9 << 2 (base 10): 00000000000000000000000000100100 (base 2) = 36 (base 10)
You could also try:
(Zero-fill right shift)
This operator shifts the first operand the specified number of bits to the right. Excess bits shifted off to the right are discarded. Zero bits are shifted in from the left. The sign bit becomes 0, so the result is always non-negative.
For non-negative numbers, zero-fill right shift and sign-propagating right shift yield the same result. For example, 9 >>> 2 yields 2, the same as 9 >> 2:
9 (base 10): 00000000000000000000000000001001 (base 2)
--------------------------------
9 >>> 2 (base 10): 00000000000000000000000000000010 (base 2) = 2 (base 10)
However, this is not the case for negative numbers. For example, -9 >>> 2 yields 1073741821, which is different than -9 >> 2 (which yields -3):
-9 (base 10): 11111111111111111111111111110111 (base 2)
--------------------------------
-9 >>> 2 (base 10): 00111111111111111111111111111101 (base 2) = 1073741821 (base 10)
As others have stated in the comments, you could actually hamper your overall performance in the long-run if you are attempting to manipulate data that is not specifically word/double/etc-aligned. This is because your hardware will have to work a bit harder to try and piece together what you truly need.

Just another thought. One parameter is the range of numbers you have. But also other properties can help save storage. For example, when you know that each number will be divisible by some multiple of 8, you need not store the lower 3 bits, since you know they are 0 all the time. (This is how the JVM stores "compressed" references.)
Or, to take another possible scenario: When you store prime numbers, then all of them (except 2) will be odd. So no need to store the lowest bit, as it is always 1. Of course you need to handle 2 separately. A similar trick is used in floating point representations: since the first bit of the mantissa of a non-null number is always 1, it is not stored at all, thus increasing precision by 1 bit.

One solution is to use bit manipulation and use a number of bits of your choosing to store a single number. Say you select to use 5 bits. You can then store 4 such numbers in 4 bytes. You need to pack and unpack the bits into an integer when operations need to be done.
You need to decide if you want to deal with negative numbers in which case you need to store a sign bit.
To make it easier to use, you need to create a class that will conceal the nitty-gritty details via get and store operations.
In light of the questions about performance, as is often the case, we are trading space for performance or vise versa. Depending on the situation various optimization techniques can be used to minimize the number of CPU cycles.
That said, is there a need for such optimization in the first place? If so, is it at the memory level or storage level? Could we use a generic mechanism such as compression to take care of this instead of using special techniques?

Related

Understanding java code - Check if an integer is power of 2 [duplicate]

This question already has answers here:
n & (n-1) what does this expression do? [duplicate]
(4 answers)
Closed 6 years ago.
I saw this efficiently written code at Leetcode.com
public static boolean isPowerOfTwo(int n) {
return n>0 && ((n&(n-1))==0);
}
This works awesomely fine but I am not able to figure out the working of single '&' in the code.
Can some onetake efforts to explain how the code works?
And by the same logic what would eb the code to determine if an integer is power of 3?

The single & is a bitwise 'and' operator (as opposed to &&, which acts on booleans).
So when you use & on two integers, the result is the logical 'and' of their binary representations.
This code works because any power of 2 will be a 1 followed by some number of 0s in binary (eg, 4 is 100, 8 is 1000, etc). Any power of 2, less one, will just be all 1s (eg. 3 is 11, 7 is 111).
So, if you take a power of 2, and bitwise and it with itself minus 1, you should just have 0. However, anything other than a power of 2 would give a non-zero answer.
Example:
1000 = 8
0111 = 7 (8-1), and '&'ing these gives
0000 = 0
However, if you had something like 6 (which isnt a power of 2):
110 = 6
101 = 5 (6-1), and '&'ing these gives
100 = 4 (which isnt equal to 0, so the code would return false).
I hope that makes it clear!

The & in Java is a bitwise and operator. It takes two integers and performs an and operation on each bit, producing an int where each bit is set to '1' if and only if that bit was '1' in both operands. The code uses the understanding that any power of two in binary is a '1' followed by some number of '0's. This means that subtracting one will change ALL the bits of the number. For any non power of two, there will be at least one nonzero digit after the first, so the first digit will remain the same. Since performing an AND on two different values always produces '0', ANDing the original number and itself minus one will produce 0 if and only if that number is a power of two. Because this is a trick with binary number specifically, it wouldn't work for finding powers of other bases.

To understand how this function works you need to understand how binary numbers are represented. If you don't I suggest reading a tutorial such as Learn Binary (the easy way).
So say we have a number, 8, and we want to find out if it's a power of two. Let's convert it to binary first: 1000. Now let's look at 8-1 = 7's binary form: 0111. The & operator is for binary AND. When we apply the AND operator to 8 and 7 we get:
1000
0111
&----
=0000
Every integer which is a power of 2 is a 1 followed by a non-negative amount of zeroes. When you subtract 1 from that number you will always get a 0 followed by a sequence of 1s. Since applying the AND operation to those two numbers will always give you 0, you can always verify if it's a power of 2. If the number is not a power of 2, when you subtract 1 from it it won't invert all of its digits and the AND test will produce a positive number (fail).

Its an bitwise operator :
if we take 2 exponent 3 equals to 8,
e.g 2³ = 2×2×2 = 8
now to calculate if 8 is a power of 2, it works like this:
n&(n-1) --> 8 AND (8-1) --> 1000 AND 0111 = 0000
thus it satisfies the condition --> (n&(n-1))==0

The single "&" performs a bitwise AND operation, meaning that in the result of A & B with A and B being integers only those bits will be set to 1 where both A and B have a 1.
For example, lets look a the number 16:
16 & (16 - 1) =
00010000 &
00001111 =
00000000
This works for powers of two because any power of two minus one will have all lower bits set, or in other words n bits can express n different values including zero, therefore (2^n)-1 is the highest value that can be expressed in n bits when they're all set.
I hope this helps.
Powers of three are a bit more problematic as our computers don't use ternary numbers. Basically a power of three is any ternary number that only has one digit different from zero and where that digit is a "1" just like in any other number system.
From the top of my head, I can't come up with anything more elegant than repeatedly doing modulo 3 until you reach one as a division result (in which case you'd have a power of three) or a nonzero modulo result (which would mean it's not a power of three).
Maybe this can help as well: http://www.tutorialspoint.com/computer_logical_organization/number_system_conversion.htm

java : shift distance for int restricted to 31 bits

Any idea why shift distance for int in java is restricted to 31 bits (5 lower bits of the right hand operand)?
http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19
x >>> n
I could see a similar question java Bit operations >>> shift but nobody pointed the right answer

The shift distance is restricted to 31 bits because a Java int has 32 bits. Shifting an int number by more than 32 bits would produce the same value (either 0 or 0xFFFFFFFF, depending on the initial value and the shift operation you use).

It's a design decision, but it seems a bit unfortunate at least for some use cases. First some terminology: let's call the approach of defining as zero all shifts amounts larger than the number of bits in the shifted word the saturating approach, and the Java approach of using only the bottom 5 (or 6 for long) bits to define the shift amount as the mod approach.
You can look at the problem by listing the useful shift values. Those are shift amounts that result in unique output values1. If you take >>>, the interesting values are 0 though 32 inclusive. 0 results in an unchanged value, and 32 results in 0. Shifting by more than 32 would again produce the same result as 32, sure - but java doesn't even let you shift by 32: it stops at 31! A shift by 32 will, perhaps unexpectedly, leave your value unchanged.
In many uses of >>> a shift by 32 is not possible, or the Java behavior works. In other cases, however, the natural result is 32, and you must special case zero.
As to why they would choose that design? Well, it probably helped that the common PC hardware at the time (x86, just like today) implements shifts in exactly that way (using only the last 5 bits for 32-bit shifts, and the last 6 for 64-bits). So the shifts can be directly mapped to hardware without any special cases, conditional moves or branches2.
Furthermore, for hardware that doesn't implement those semantics by default, it is easy to get the Java semantics by a simple mask: shiftAmount & 0x1F. That's going to be fast on all hardware. The reverse mapping - implementing saturating shifts on hardware that doesn't support it is more complex: you may need a costly compare and branch, some bit twiddling hacks or predicated moves to handle the > 31 case.
Finally, the mod approach is quite natural for many algorithms. For example, if you are implementing a bitmap structure, addressable per-bit, a good implementation may be to have an array of integers, with each integer representing 32 bits. Internally to index into the Nth bit, you would break N down into two parts - the high 27 bits would find the word in the array the bit is in, and the low 5 bits would pick the bit out of the word. To pick the bit out of the word (e.g., to move it to the LSB), you might do:
int val = (word >>> (index & 0x1F)) & 1
That sets val to 1 if the bit was set, 0 otherwise. However, because of the way the Java >>> operator was specified, you don't need the & 0x1F part at all, because it is already implied in the mod definition! So you can omit it, and indeed the JDK's BitSet uses exactly that trick.
1 Granted, any value without a 1 in the MSB may not produce unique values under >>>, once all the 1s get shifted off, so let's just talk about any value with a leading one.
2 For what it's worth, I checked ARM and the semantics are even weirder: for variable shifts, the bottom eight bits of the shift amount is used. So the shift is a weird hybrid - it is effectively a saturating shift once you exceed 31, but only up to 255, at which point it loops around and suddenly has non-zero values for the next 31 values, etc.

Why is the maximum capacity of a Java HashMap 1<<30 and not 1<<31?

Why is the maximum capacity of a Java HashMap 1<<30 and not 1<<31, even though the max value of an int is 231-1? The maximum capacity is initialized as static final int MAXIMUM_CAPACITY = 1 << 30;

Java uses signed integers which means the first bit is used to store the sign of the number (positive/negative).
A four byte integer has 32 bits in which the numerical portion may only span 31 bits due to the signing bit. This limits the range of the number to 2^31 - 1 (due to inclusion of 0) to - (2^31).

While it would be possible for a hash map to handle quantities of items between 2^30 and 2^31-1 without having to use larger integer types, writing code which works correctly even near the upper limits of a language's integer types is difficult. Further, in a language which treats integers as an abstract algebraic ring that "wraps" on overflow, rather than as numbers which should either yield numerically-correct results or throw exceptions when they cannot do so, it may be hard to ensure that there aren't any cases where overflows would cause invalid operations to go undetected.
Specifying an upper limit of 2^30 or even 2^29, and ensuring correct behavior on things no larger than that, is often much easier than trying to ensure correct behavior all the way up to 2^31-1. Absent a particular reason to squeeze out every last bit of range, it's generally better to use the simpler approach.

By default, the int data type is a 32-bit signed two's complement integer, which has a minimum value of -2^31 and a maximum value of (2^31)-1, ranges from –2,147,483,648 to 2,147,483,647.
The first bit is reserved for the sign bit — it is 1 if the number is negative and 0 if it is positive.
1 << 30 is equal to 1,073,741,824
it's two's complement binary integer is 01000000-00000000-00000000-00000000.
1 << 31 is equal to -2,147,483,648.
it's two's complement binary integer is 10000000-00000000-00000000-00000000.
It says the maximum size to which hash-map can expand is 1,073,741,824 = 2^30.

You are thinking of unsigned, with signed upper range is (2^31)-1

Compact format for floating-point numbers

There are special formats (base-128) designed for transmitting integers used in protobufs and elsewhere. They're advantageous when most the integers are small (they need a single byte for smallest numbers and may waste one byte for others).
I wonder if there's something similar for floating point numbers under the assumption that most of them are actually small integers?
To address the answer by Alice: I was thinking about something like
void putCompressedDouble(double x) {
int n = (int) x;
boolean fits = (n == x);
putBoolean(fits);
if (fits) {
putCompressedInt(n);
} else {
putUncompressedLong(Double.doubleToLongBits(x));
}
}
This works (except for the negative zero, which I really don't care about), but it's wasteful in case of fits == true.

It depends on the distribution of your numbers. Magnitude doesn't really matter that much, since its expressed through the exponent field of a float. Its usually the mantissa that contributes the most "weight" in terms of storage.
If your floats are mainly integers, you may gain something by converting to int (via Float.floatToIntBits()), and checking how many trailing zeros there are (for small int values there should be up to 23 trailing zeros). When using a simple scheme to encode small int's, you may implement encoding floats simply as:
int raw = Float.floatToIntBits(f);
raw = Integer.reverse(raw);
encodeAsInt(raw);
(Decoding is simply reversing the process).
What this does is simply move the trailing zeros in the mantissa to the most significant bits of the int representation, which is friendly to encoding schemes devised for small integers.
Same can be applied to double<->long.

Probably not, and this is almost certainly not something you want.
As noted at this stack overflow post, floating point numbers are not stored in a platform independent way in protocol buffers; they are essentially bit for bit representations that are then cast using a union. This means float will take 4 bytes and double 8 bytes. This is almost certainly what you want.
Why? Floating points are not integers. The integers are a well formed group; each number is valid, every bit pattern represents a number, and they exactly represent the integer they are. Floating points cannot represent many important numbers exactly: most floats can't represent 0.1 exactly, for example. The problem of infinities, NAN's, etc etc, all make a compressed format a non-trivial task.
If you have small integers in a float, then convert them to small integers or some fixed point precision format. For example, if you know you only have....4 sigfigs, you can convert from floating point to a fixed point short, saving 2 bytes. Just make sure each end knows how to deal with this type, and you'll be golden.
But any operation that google could do to try and save space in this instance would be both reinventing the wheel and potentially dangerous. Which is probably why they try not to mess with floats at all.

I really like Durandal's solution. Despite its simplicity, it performs pretty well, at least for floats. For doubles with their exponent longer than one byte, some additional bit rearrangement might help. The following table gives the encoding length for numbers with up to D digits, negative numbers are also considered. In each column the first number given the maximum bytes needed while the parenthesized number is the average.
D AS_INT REV_FLOAT REV_DOUBLE BEST
1: 1 (1.0) 2 (1.8) 3 (2.2) 1 (1.0)
2: 2 (1.4) 3 (2.4) 3 (2.8) 2 (1.7)
3: 2 (1.9) 3 (2.9) 4 (3.2) 2 (2.0)
4: 3 (2.2) 4 (3.3) 4 (3.8) 3 (2.6)
5: 3 (2.9) 4 (3.9) 5 (4.1) 3 (3.0)
6: 3 (3.0) 5 (4.2) 5 (4.8) 4 (3.5)
7: 4 (3.9) 5 (4.8) 6 (5.1) 4 (3.9)
8: 4 (4.0) 5 (4.9) 6 (5.8) 5 (4.3)
9: 5 (4.9) 5 (4.9) 6 (6.0) 5 (4.9)
Four different methods were tested:
AS_INT: Simply convert the number to int. This is unusable but gives us a lower bound.
REV_FLOAT: The method by Durandal applied to floats.
REV_DOUBLE: The method by Durandal applied to doubles.
BEST: An improvement of my own method as described in the question. Rather complicated.

Why do integers in Java integer not use all the 32 or 64 bits?

I was looking into 32-bit and 64-bit. I noticed that the range of integer values that can stored in 32 bits is ±4,294,967,295 but the Java int is also 32-bit (If I am not mistaken) and it stores values up to ±2 147 483 648. Same thing for long, it stores values from 0 to ±2^63 but 64-bit stores ±2^64 values. How come these values are different?

Integers in Java are signed, so one bit is reserved to represent whether the number is positive or negative. The representation is called "two's complement notation." With this approach, the maximum positive value represented by n bits is given by
(2 ^ (n - 1)) - 1
and the corresponding minimum negative value is given by
-(2 ^ (n - 1))
The "off-by-one" aspect to the positive and negative bounds is due to zero. Zero takes up a slot, leaving an even number of negative numbers and an odd number of positive numbers. If you picture the represented values as marks on a circle—like hours on a clock face—you'll see that zero belongs more to the positive range than the negative range. In other words, if you count zero as sort of positive, you'll find more symmetry in the positive and negative value ranges.
To learn this representation, start small. Take, say, three bits and write out all the numbers that can be represented:
0
1
2
3
-4
-3
-2
-1
Can you write the three-bit sequence that defines each of those numbers? Once you understand how to do that, try it with one more bit. From there, you imagine how it extends up to 32 or 64 bits.
That sequence forms a "wheel," where each is formed by adding one to the previous, with noted wraparound from 3 to -4. That wraparound effect (which can also occur with subtraction) is called "modulo arithemetic."

In 32 bit you can store 2^32 values. If you call these values 0 to 4294967295 or -2147483648 to +2147483647 is up to you. This difference is called "signed type" versus "unsigned type". The language Java supports only signed types for int. Other languages have different types for an unsigned 32bit type.
NO laguage will have a 32bit type for ±4294967295, because the "-" part would require another bit.

That's because Java ints are signed, so you need one bit for the sign.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.