I am working on implementing some bloom filter variants, and a very useful data structure for this would be a compact multi-bit array; that is, an array where each element is a compact integer of around 4 bits.
Space efficiency is of the utmost importance here, so while a plain integer array would give me the functionality I want, it would be bulkier than necessary.
Before I try to implement this functionality myself with bit arithmetic, I was wondering if anyone knows of a library out there that already provides such a data structure.
Edit: Static size is fine.
The ideal case would be an implementation that is flexible with regard to the number of bits per cell. That might be a bit much to hope for though (no pun intended?).
If you aren't modifying the array after creation, java.util.BitSet does all the bit masking for you but is slow to access since you have to fetch each bit individually and do the masking yourself to re-create the int from 4 bits.
Having said that writing it yourself might be the best way to go. Doing the bit arithmetic yourself isn't that difficult since it's only 2 values per byte so decoding the high bits are (array[i] & 0xF0) >> 4 and the low bits are array[i] & 0x0F
Take a look at the compressed BitSet provided by http://code.google.com/p/javaewah/, it allows to set bits freely and will ensure that it uses memory efficiently via compression algorithms being used.
I.e. something like
EWAHCompressedBitmap32 set = new EWAHCompressedBitmap32();
set.set(0);
set.set(1000000);
will still only occupy a few bytes, not one MB as with the Java BitSet...
You should be able to map the 4-bit integer to the BitSet by multiplying the index into the BitSet accordingly
Related
How would one efficiently compare two bits of equally sized byte[] In Java? BitSet might be used by constructing them from given arrays, however this approach is not as for example efficient as shifting through the arrays using bit manipulation and bitmasks. What would be implementation be alike?
BitSet might be used by constructing them from given arrays, however this approach is not as for example efficient as shifting through the arrays using bit manipulation and bitmasks.
BitSet actually uses bit manipulation and bit masks under the hood.
If you still want to use only byte[] (e.g. to avoid overhead on BitSet construction from byte[]), take a look at BitSet#get(int) implementation
Create a long value by reading 64 bit chunk from first array;
Create another long value by reading 64 bit chunk from second array;
If XOR of both longs is not 0, these chunks had some value that was different;
If you are done with entire length of array and all XORs were 0, the arrays were same.
Actually, my question is very similar with this one, but the post is focus on the C# only. Recently I read an article said that java will 'promote' some short types (like short) to 4 bytes in memory even if some bits are not used, so it can't reduce usage. (is it true ?)
So my question is how languages, especially C, C++ and java (as Manish said in this post talked about java), handles memory allocation of small datatypes. References or any approaches to figure out it are preferred. Thanks
C/C++ uses only the specified amount of memory but aligns the data (by default) to an address that is a multiple of some value, typically 4 bytes for 32 bit applications or 8 bytes for 64 bit.
So for example if the data is aligned on a 4 or 8 byte boundary then a "char" uses only one byte. An array of 5 chars will use 5 bytes. But the data item that is allocated after the 5 byte char array is placed at an address that skips 3 bytes to keep it correctly aligned.
This is for performance on most processors. There are usually pragmas like "pack" and "align" that can be used to change the alignment or disable it.
In C and C++, different approaches may be taken depending on how you've requested the memory.
For T* p = (T*)malloc(n * sizeof(T)); or T* p = new T[n]; then the data will occupy sizeof(T)*n bytes of memory, so if sizeof(T) is reduced (e.g. to int16_t instead of int32_t) then that space is reduced accordingly. That said, heap allocations tend to have some overheads, so few large allocations are better than a great many allocations for individual data items or very small arrays, where the overheads may be much more significant than small differences in sizeof(T).
For structures, static and stack usage, padding is more significant than for large arrays, as the following data item might be of a different type with different alignment requirements, resulting in more padding.
At the other extreme, you can apply bitfields to effectively pack values into the minimum number of bits they need - very dense compression indeed, though you need to rely on compiler pragmas/attributes if you want explicit control - the Standard leaves it unspecified when a bitfield might start in a new memory "word" (e.g. 32 bit memory word for a 32 bit process, 64 for 64) or wrap across separate words, where in the words the bits hold data vs padding etc.). Data types like C++ bitsets and vector<bool> may be more efficient than arrays of bool (which may well use an int for each element, but it's unspecified in the C++03 Standard).`
I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.
You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.
One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.
We have an interesting challenge. We have to control access to data that reside in "bins". There will be, potentially, hundreds of thousands of "bins". Access to each bin is controlled individually but the restrictions can, and probably will, overlap. We are thinking of assigning each bin a position in a bitmask (1,2,3,4, etc..).
Then when a user logs into the system, we look at his security attributes and determine which bins he's allowed to see. With that info we construct a bitmask for this user where the "set" bits correspond to the identifier of the bins he's allowed to see. So if he can see bins 1, 3 and 4, his bit mask would be 1101.
So when a user searches the data, we can look at the bin index of the returned row and see if that bit is set on his bitmask. If his bitmask has that bit set we let him see that row. We are planning for the bitmask to be stored as a BigInteger in Java.
My question is: Assuming the index number doesn't get bigger that Integer.MAX_INT, is a BigInteger bitmask going to scale for hundreds of thousands of bit positions? Would it take forever to run BigInteger.isBitSet(n) where n could be huge (e.g. 874,837)? Would it take forever to create such a BigInteger?
And secondly: If you have an alternative approach, I'd love to hear it.
BigInteger should be fast if you don't change it often.
A more obvious choice would be BitSet which is designed for this sort of thing. For looking up bits, I suspect the performance is similar. For creating/modifying it would be more efficient to use a BitSet.
Note: PaulG has commented the difference is "impressive" and BitSet is faster.
Java has a more convenient class for this, called BitSet.
You do not need to check if the bit is set in a loop: you can make a mask, use a bitwise and, and see if the result is non-empty to decide on whether to grant or deny the access:
BitSet resourceAccessMask = ...
BitSet userAllowedAccessMask = ...
BitSet test = (BitSet)resourceAccessMask.clone();
test.and(userAllowedAccessMask);
if (!test.isEmpty()) {
System.out.println("access granted");
} else {
System.out.println("access denied");
}
We used this class in a similar situation in my prior company, and the performance was acceptable for our purposes.
You could define your own Java interface for this, initially using a Java BitSet to implement that interface.
If you run into performance issues, or if you require the use of long later on, you may always provide a different implementation (e.g. one that uses caching or similar improvements) without changing the rest of the code. Think well about the interface you require, and choose a long index just to be sure, you can always check if it is out of bounds in the implementation later on (or simply return "no access" initially) for anything index > Integer.MAX_VALUE.
Using BigInteger is not such a good idea, as the class was not written for that particular purpose, and the only way of changing it is to create a fully new copy. It is efficient regarding memory use; it uses an array consisting 64 bit longs internally (at the moment, this could of course change).
One thing that should be worth considering (beside using BitSet) is using different granularity. Therefore you use a shorter bit set where each bit 'guards' multiple real bits. This way you would not need to have millions of bits per user in ram.
A simple way to achieve this is having a smaller bit set like n/32 and do something like this:
boolean isSet(int n) {
return guardingBits.isSet(n / 32) && realBits.isSet(n);
}
This gives you a good chance to avoid loading the real bits if those bits are mostly zero. You can modify this approach to match the expected bit-set. If you expect almost all bits are set you can use this guarding bits for storing a one if all bits it guards are set. So you only need to check for bits that might be zero.
Also this might be even the beginning. Depending on the usage and requirements you might want to use a B-tree or a paginated version where you only held a fraction of the big bit field in memory.
Setting aside the heap's capacity, are there ways to go beyond Integer.MAX_VALUE constraints in Java?
Examples are:
Collections limit themselves to Integer.MAX_VALUE.
StringBuilder / StringBuffer limit themselves to Integer.MAX_VALUE.
If you have a huge Collection you're going to hit all sorts of practical limits before you ever have 231 - 1 items in it. A Collection with a million items in it is going to be pretty unwieldy, let alone one with more than a thousands times more than that.
Similarly, a StringBuilder can build a String that's 2GB in size before it hits the MAX_VALUE limit which is more than adequate for any practical purpose.
If you truly think that you might be hitting these limits your application should be storing your data in a different way, probably in a database.
With a long? Works for me.
Edit: Ah, clarification of the question. Cool. My new and improved answer:
With a paging algorithm.
Coincidentally, somewhat recently for another question (Binary search in a sorted (memory-mapped ?) file in java), I whipped up a paging algorithm to get around the int parameters in the java.nio.MappedByteBuffer API.
You can create your own collections which have a long size() based on the source code for those collections. To have larger arrays of Objects for example, you can have an array of arrays (and stitch these together)
This approach will allow almost 2^62 elements.
Array indexes are limited by Integer.MAX_VALUE, not the physical size of the array.
Therefore the maximum size of an array is linked to the size of the array-type.
byte = 1 byte => max 2 Gb data
char = 2 byte => max 4 Gb data
int = 4 byte => max 8 Gb data
long = 8 byte => max 16 Gb data
Dictionaries are a different story because they often use techniques like buckets or an internal data layout as a tree. Therefore these "limits" usually dont apply or you will need even more data to reach the limit.
Short: Integer.MAX_VALUE is not really a limit because you need lots of memory to actually reach the limit. If you should ever reach this limit you might want to think about improving your algorithm and/or data-layout :)
Yes, with BigInteger class.
A memory upgrade is necessary.. :)