Efficient growable java byte array that allows access to bytes

Efficient growable java byte array that allows access to bytes - java

Does anyone know of a Java class to store bytes that satisfies the following conditions?
Stores bytes efficiently (i.e. not one object per bytes).
Grows automatically, like a StringBuilder.
Allows indexed access to all of its bytes (without copying everything to a byte[].
Nothing I've found so far satisfies these. Specifically:
byte[] : Doesn't satisfy 2.
ByteBuffer : Doesn't satisfy 2.
ByteArrayOutputStream : Doesn't satisfy 3.
ArrayList : Doesn't satisfy 1 (AFAIK, unless there's some special-case optimisation).
If I can efficiently remove bytes from the beginning of the array that would be nice. If I were writing it from scratch I would implement it as something like
{ ArrayList<byte[256]> data; int startOffset; int size; }
and then the obvious functions. Does something like this exist?

Most straightforward would be to subclass ByteArrayOutputStream and add functionality to access the underlying byte[].
Removal of bytes from the beginning can be implemented in different ways depending on your requirements. If you need to remove a chunk, System.arrayCopy should work fine, if you need to remove single bytes I would put a headIndex which would keep track of the beginning of the data (performing an arraycopy after enough data is "removed").

There are some implementations for high performance primitive collections such as:
hppc or Koloboke

You'd have to write one. Off the top of my head what I would do is create an ArrayList internally and store the bytes 4 to each int, with appropriate functions for masking off the bytes. Performance will be sub optimal for removing and adding individual bytes. However it will store the object in the minimal size if that is a real consideration, wasting no more than 3 bytes for storage (on top of the overhead for the ArrayList).

The laziest method will be ArrayList. Its not as inefficient as you seem to believe, since Byte instances can and will be shared, meaning there will be only 256 byte objects in the entire VM unless you yourself do a "new Byte()" somewhere.

Related

What is the purpose of a Buffer in Java?

Buffer is an abstract class having concrete subclasses such as ByteBuffer, IntBuffer, etc. It seems to be a container of data of a specific primitive type. What are the benefits of a Buffer? Why wouldn't I just use an array or a list?

A buffer can be defined, in its simplest form, as a contiguous block of memory of some type. Hence a byte buffer of size 4K (4096 bytes) may occupy memory locations 0xf000 through 0xffff inclusive.
As to why a buffer type may be used instead of an array or list, neither of those two alternatives have the in-built features of limit, position or mark.
On the first item, a buffer separates the capacity from the limit in that you can have a capacity of 1000 with a current limit of 10. In other words, it enforces the ability to have a variable size up to and including the capacity.
For the other two features, the current position provides an in-built way to read or write the next element, easing sequential processing, and the mark provides a way to save the current position for later reset.
All these features would require extra variables if you needed them in conjunction with an array or list.
Of course, if you don't need any of these features then, by all means, use an array.

Why is the Minimum granularity defined as 8192 in Java8 in order to switch from Parallel Sort to Arrays.sort regardless of type of data

I was going through the concepts of parallel sort introduced in Java 8.
As per the doc.
If the length of the specified array is less than the minimum
granularity, then it is sorted using the appropriate Arrays.sort
method.
The spec however doesn't specify this minimum limit.
When I looked up the Code in java.util.Arrays it was defined as
private static final int MIN_ARRAY_SORT_GRAN = 1 << 13;
i.e., 8192 values in the array
As per the explanation provided here.
I understand why the value was Hard-coded as 8192.
It was designed keeping the current CPU architecture in mind.
With the -XX:+UseCompressedOops option being enabled by default, any
system with less than 32GB RAM would be using 32bit(4bytes) pointers.
Now, with a L1 Cache size of 32KB for data portion, we can pass
32KB/4Bytes = 8KB of data at once to CPU for computation. That's
equivalent to 8192 bytes of data being processed at once.
So for a function which is working on sorting a byte array parallelSort(byte[]) this makes sense. You can keep minimum parallel sort limit as 8192 values (each value = 1 byte for byte array).
But If you consider public static void parallelSort(int[] a)
An Integer Variable is of 4Bytes(32-bit). So ideally of the 8192 bytes, we can store 8192/4 = 2048 numbers in CPU cache at once.
So the minimum granularity in this case is suppose to be 2048.
Why are all parallelSort functions in Java (be it byte[], int[], long[], etc.) using 8192 as the default min. number of values needed in order to perform parallel sorting?
Shouldn't it vary according to the types passed to the parallelSort function?

First, it seems that you've misread the linked explanation. L1 data cache is 32Kb, so for int[] it fits ideally: 32768/4=8192 ints could be placed into L1 cache while.
Second, I don't think the given explanation is correct. It concentrates on pointers, so it says mainly about sorting object array, but when you compare the data in the objects array, you always need to dereference these pointers accessing the real data. And in case if your objects have non-primitive fields, you'll have to dereference them even further. For example, if you sort an array of strings, you have to access not only array itself, but also String objects and char[] arrays which are stored inside them. All of these would require many additional cache lines.
I did not find any explicit explanation about this particular value in review thread for this change. Previously it was 256, then it was changed to 8192 as part of JDK-8014076 update. I think it just shown best performance on some reasonable test suite. Keeping separate thresholds for different cases would add more complexity. Probably tests show that it's not paying off. Note that ideal threshold is impossible for Object[] arrays as compare function is user-specified and could have arbitrary complexity. For sufficiently complex compare function it's probably reasonable to parallelize even very small arrays.

Is there a way to efficiently store a sequence of enum values in Java?

I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.

You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.

One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.

HashSet of Strings taking up too much memory, suggestions...?

I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance

You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie

Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.

HashSet is probably not the right structure for this. Use Trie instead.

This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky

12MB to store 120,000 words is about 100 bytes per word. Probably at least 32 bytes of that is String overhead. If words average 10 letters and they are stored as 2-byte chars, that accounts for another 20 bytes. Then there is the reference to each String in your HashSet, which is probably another 4 bytes. The remaining 44 bytes is probably the HashSet entry and indexing overhead, or something I haven't considered above.
The easiest thing to go after is the overhead of the String objects themselves, which can take far more memory than is required to store the actual character data. So your main approach would be to develop a custom representation that avoids storing a separate object for each string. In the course of doing this, you can also get rid of the HashSet overhead, since all you really need is a simple word lookup, which can be done by a straightforward binary search on an array that will be part of your custom implementation.
You could create your custom implementation as an array of type int with one element for each word. Each of these int elements would be broken into sub-fields that contain a length and an offset that points into a separate backing array of type char. Put both of these into a class that manages them, and that supports public methods allowing you to retrieve and/or convert your data and individual characters given a string index and an optional character index, and to perform the simple searches on the list of words that are needed for your spell check feature.
If you have no more than 16777216 characters of underlying string data (e.g., 120,000 strings times an average length of 10 characters = 1.2 million chars), you can take the low-order 24 bits of each int and store the starting offset of each string into your backing array of char data, and take the high-order 8 bits of each int and store the size of the corresponding string there.
Your char data will have your erstwhile strings crammed together without any delimiters, relying entirely upon the int array to know where each string starts and ends.
Taking the above approach, your 120,000 words (at an average of 10 letters each) would require about 2,400,000 bytes of backing array data and 480,000 bytes of integer index data (120,000 x 4 bytes), for a total of 2,880,000 bytes, which is about a 75 percent savings over the present 12MB amount you have reported above.
The words in the arrays would be sorted alphabetically, and your lookup process could be a simple binary search on the int array (retrieving the corresponding words from the char array for each test), which should be very efficient.
If your words happen to be entirely ASCII data, you could save an additional 1,200,000 bytes by storing the backing data as bytes instead of as chars.
This could get more difficult if you needed to alter these strings. Apparently, in your case (spell checker), you don't need to (unless you want to support user additions to the list, which would be infrequent anyway, and so re-writing the char data and indexes to add or delete words might be acceptable).

One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.
As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).

The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:
You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).
There is no way to reduce memory-usage in this design.

You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.

Custom java serialization of message

While writing a message on wire, I want to write down the number of bytes in the data followed by the data.
Message format:
{num of bytes in data}{data}
I can do this by writing the data to a temporary byteArrayOutput stream and then obtaining the byte array size from it, writing the size followed by the byte array. This approach involves a lot of overhead, viz. unnecessary creation of temporary byte arrays, creation of temporary streams, etc.
Do we have a better (considering both CPU and garbage creation) way of achieving this?

A typical approach would be to introduce a re-useable ByteBuffer. For example:
ByteBuffer out = ...
int oldPos = out.position(); // Remember current position.
out.position(oldPos + 2); // Leave space for message length (unsigned short)
out.putInt(...); // Write out data.
// Finally prepend buffer with number of bytes.
out.putShort(oldPos, (short)(out.position() - (oldPos + 2)));
Once the buffer is populated you could then send the data over the wire using SocketChannel.write(ByteBuffer) (assuming you are using NIO).

Here’s what I would do, in order of preference.
Don’t bother about memory consumption and stuff. Most likely this already is the optimal solution unless it takes a lot of time to create the byte representation of your data so that creating it twice is a noticable impact.
(Actually this would be more like #37 on my list, with #2 to #36 being empty.) Include a method in your all your data objects that can calculate the size of the byte representation and takes less resources than it would to create the byte representation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.