Java MemoryMapping big files

Java MemoryMapping big files - java

The Java limitation of MappedByteBuffer to 2GIG make it tricky to use for mapping big files. The usual recommended approach is to use an array of MappedByteBuffer and index it through:
long PAGE_SIZE = Integer.MAX_VALUE;
MappedByteBuffer[] buffers;
private int getPage(long offset) {
return (int) (offset / PAGE_SIZE)
}
private int getIndex(long offset) {
return (int) (offset % PAGE_SIZE);
}
public byte get(long offset) {
return buffers[getPage(offset)].get(getIndex(offset));
}
this can be a working for single bytes, but requires rewriting a lot of code if you want to handle read/writes that are bigger and require crossing boundaries (getLong() or get(byte[])).
The question: what is your best practice for these kind of scenarios, do you know any working solution/code that can be re-used without re-inventing the wheel?

Have you checked out dsiutil's ByteBufferInputStream?
Javadoc
The main usefulness of this class is that of making it possible creating input streams that are really based on a MappedByteBuffer.
In particular, the factory method map(FileChannel, FileChannel.MapMode) will memory-map an entire file into an array of ByteBuffer and expose the array as a ByteBufferInputStream. This makes it possible to access easily mapped files larger than 2GiB.
long length()
long position()
void position(long newPosition)
Is that something you were thinking of? It's LGPL too.

Related

Why java doesn't give a bit read api?

I am using java ByteBuffer to save some basic data into streams. One situation is that I must transfer a "Boolean list" from one machine to another through the internet, so I want the buffer to be as small as possible.
I know the normal way of doing this is using buffer like this:
public final void writeBool(boolean b) throws IOException {
writeByte(b ? 1 : 0);
}
public final void writeByte(int b) throws IOException {
if (buffer.remaining() < Byte.BYTES) {
flush();
}
buffer.put((byte) b);
}
public boolean readBool(long pos) {
return readByte(pos) == 1;
}
public int readByte(long pos) {
return buffer.get((int)pos) & 0xff;
}
This is a way of converting a boolean into byte and store into buffer.
But I'm wandering, why not just putting a bit into buffer, so that a a byte can represent eight booleans, right?
The code maybe like this? But java doesn't have a writeBit function.
public final void writeBool(boolean b) throws IOException {
// java doesn't have it.
buffer.writeBit(b ? 0x1 : 0x0);
}
public final boolean readBool(long pos) throws IOException {
// java doesn't have it
return buffer.getBit(pos) == 0x01;
}
So I think the only way doing that is "store eight booleans into a byte and write",like ((0x01f >>> 4) & 0x01) == 1 to check if the fifth boolean is true. But if I can get a byte, why not just let me get a bit?
Is there some other reason that java cannot let us operate bit?

Yeah, so I mean why not create a BitBuffer?
That would be a question for the Java / OpenJDK development team, if you want a definitive answer. However I expect they would make these points:
Such a class would have extremely limited utility in real applications.
Such a class is unnecessary given that an application doing (notionally) bit-oriented I/O can be implemented using ByteBuffer and a small amount of "bit-twiddling"
There is the technical issue that main-stream operating systems, and main-stream network protocols only support I/O down to the granularity of a byte1. So, for example, file lengths are recorded in bytes, and creating a file containing precisely 42 bits of data (for example) is problematic.
Anyway, there is nothing stopping you from designing and writing your own BitBuffer class; e.g. as a wrapper for ByteBuffer. And sharing it with other people who need such a thing.
Or looking on (say) Github for a Java class called BitBuffer.
1 - Indeed I don't know of any operating system, file system or network protocol that has a smaller granularity than this.

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

I solved this problem using a graph, but unfortunately now I'm stuck with having to use a 2d array and I have questions about the best way to go about this:
public class Data {
int[][] structure;
public data(int x, int y){
structure = new int[x][y]
}
public <<TBD>> generateRandom() {
// This is what my question is about
}
}
I have a controller/event handler class:
public class Handler implements EventHandler {
#Override
public void onEvent(Event<T> e) {
this.dataInstance.generateRandom();
// ... other stuff
}
}
Here is what each method will do:
Data.generateRandom() will generate a random value at a random location in the 2d int array if there exists a value in the structure that in not initialized or a value exists that is equal to zero
If there is no available spot in the structure, the structure's state is final (i.e. in the literal sense, not the Java declaration)
This is what I'm wondering:
What is the most efficient way to check if the board is full? Using a graph, I was able to check if the board was full on O(1) and get an available yet also random location on worst-case O(n^2 - 1), best case O(1). Obviously now with an array improving n^2 is tough, so I'm just now focusing on execution speed and LOC. Would the fastest way to do it now to check the entire 2d array using streams like:
Arrays.stream(board).flatMapToInt(tile -> tile.getX()).map(x -> x > 0).count() > board.getWidth() * board.getHeight()

(1) You can definitely use a parallel stream to safely perform read only operations on the array. You can also do an anyMatch call since you are only caring (for the isFull check) if there exists any one space that hasn't been initialized. That could look like this:
Arrays.stream(structure)
.parallel()
.anyMatch(i -> i == 0)
However, that is still an n^2 solution. What you could do, though, is keep a counter of the number of spaces possible that you decrement when you initialize a space for the first time. Then the isFull check would always be constant time (you're just comparing an int to 0).
public class Data {
private int numUninitialized;
private int[][] structure;
public Data(int x, int y) {
if (x <= 0 || y <= 0) {
throw new IllegalArgumentException("You can't create a Data object with an argument that isn't a positive integer.");
}
structure = new int[x][y];
int numUninitialized = x * y;
}
public void generateRandom() {
if (isFull()) {
// do whatever you want when the array is full
} else {
// Calculate the random space you want to set a value for
int x = ThreadLocalRandom.current().nextInt(structure.length);
int y = ThreadLocalRandom.current().nextInt(structure[0].length);
if (structure[x][y] == 0) {
// A new, uninitialized space
numUninitialized--;
}
// Populate the space with a random value
structure[x][y] = ThreadLocalRandom.current().nextInt(Integer.MIN_VALUE, Integer.MAX_VALUE);
}
}
public boolean isFull() {
return 0 == numUninitialized;
}
}
Now, this is with my understanding that each time you call generateRandom you take a random space (including ones already initialized). If you are supposed to ONLY choose a random uninitialized space each time it's called, then you'd do best to hold an auxiliary data structure of all the possible grid locations so that you can easily find the next random open space and to tell if the structure is full.
(2) What notification method is appropriate for letting other classes know the array is now immutable? It's kind of hard to say as it depends on the use case and the architecture of the rest of the system this is being used in. If this is an MVC application with a heavy use of notifications between the data model and a controller, then an observer/observable pattern makes a lot of sense. But if your application doesn't use that anywhere else, then perhaps just having the classes that care check the isFull method would make more sense.
(3) Java is efficient at creating and freeing short lived objects. However, since the arrays can be quite large I'd say that allocating a new array object (and copying the data) over each time you alter the array seems ... inefficient at best. Java has the ability to do some functional types of programming (especially with the inclusion of lambdas in Java 8) but only using immutable objects and a purely functional style is kind of like the round hole to Java's square peg.

How to Declare a Byte Array of Infinite Size/Dynamic in Java?

I am declaring a byte array which is of unknown size to me as it keeps on updating, so how can I declare the byte array of infinite size/variable size?

You cannot declare an array of infinite size, as that would require infinite memory. Additionally, all the allocation calls deal with numbers, not infinite quantities.
You can allocate a byte buffer that resizes on demand. I believe the easiest choice would be a ByteArrayOutputStream.
ByteBuffer has an API which makes manipulation of the buffer easier, but you would have to build the resize functionality yourself. The easiest way will be to allocate a new, larger array, copy the old contents in, and swap the new buffer for the old.
Other answers have mentioned using a List<Byte> of some sort. It is worth noting that if you create a bunch of new Byte() objects, you can dramatically increase memory consumption. Byte.valueOf sidesteps this problem, but you have to ensure that it is consistently used throughout your code. If you intend to use this list in many places, I might consider writing a simple List decorator which interns all the elements. For example:
public class InterningList extends AbstractList<Byte>
{
...
#Override
public boolean add(Byte b) {
return super.add(Byte.valueOf(b));
}
...
}
This is not a complete (or even tested) example, just something to start with...

Arrays in Java are not dynamic. You can use list instead.
List<Byte> list = new ArrayList<Byte>();
Due to autoboxing feature you can freely add either Byte objects or primitive bytes to this list.

To define a bytearray of varying length just use the apache commons.io.IOUtils library instead of assigning manual length like
byte[] b=new byte[50];
You can pass your input stream to IOUtils function which will perform a read function on this inputstream thus byte array will have exact length of bytes as required.
ex.
byte[] b = IOUtils.toByteArray(inpustream);
Chaos..

ByteArrayOutputStream will allow for writing to a dynamic byte array. However, methods such as remove, replace and insert are not available. One has to extract the byte array and then manipulate it directly.

Your best bet is to use an ArrayList. As it resizes as you fill it.
List<Byte> array = new ArrayList<Byte>();

The obvious solution would be to use an ArrayList.
But this is a bad solution if you need performance or are constrained in memory, as it doesn't really store bytes but Bytes (that is, objects).
For any real application, the answer is simple : you have to manage yourself the byte array, by using methods making it grow as necessary. You may embed it in a specific class if needed :
public class AlmostInfiniteByteArray {
private byte[] array;
private int size;
public AlmostInfiniteByteArray(int cap) {
array = new byte[cap];
size = 0;
}
public int get(int pos) {
if (pos>=size) throw new ArrayIndexOutOfBoundsException();
return array[pos];
}
public void set(int pos, byte val) {
if (pos>=size) {
if (pos>=array.length) {
byte[] newarray = new byte[(pos+1)*5/4];
System.arraycopy(array, 0, newarray, 0, size);
array = newarray;
}
size = pos+1;
}
array[pos] = val;
}
}

Use an ArrayList of any subtype of List
The different implementations of List can allow you to do different things on the list (eg different traversal strategy, different performance etc)

Initial capacity of ArrayList is 10. You can change it by ArrayList(5000).
ArrayList will double it's size when needed (it will create new array and copy the old one to to the new one).

I would tweak slightly other people's answers.
Create a LargeByteArray class to manage your array. It will have get and set methods, etc, whatever you will need.
Behind the scenes that class will use a long to hold the current length and use an ArrayList to store the contents of the array.
I would pick to store byte[8192] or byte[16384] arrays in the ArrayList. That will give a reasonable trade off in terms of size wasted and reduce the need for resizing.
You can even make the array 'sparse' ie only allocate the list.get(index/8192) entry if there is a non-zero value stored in that box.
Such a structure can give you significantly more storage in some cases.
Another strategy you can use is to compress the byte[] boxes after write and uncompress before read (use a LRU cache for reading) which can allow storing twice or more than available ram... Though that depends on the compression strategy.
After that you can look at paging some boxes out to disk...
That's as close to an infinite array as I can get you ;-)

You can make use of IOUtils from piece, as Prashant already told.
Here's a little piece from it which can solve the task (you will need IOUtils.toByteArray):
public class IOUtils {
private static final int DEFAULT_BUFFER_SIZE = 1024 * 4;
public static byte[] toByteArray(InputStream input) throws IOException {
ByteArrayOutputStream output = new ByteArrayOutputStream();
copy(input, output);
return output.toByteArray();
}
public static int copy(InputStream input, OutputStream output)
throws IOException {
long count = copyLarge(input, output);
if (count > Integer.MAX_VALUE) {
return -1;
}
return (int) count;
}
public static long copyLarge(InputStream input, OutputStream output)
throws IOException {
byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
long count = 0;
int n = 0;
while (-1 != (n = input.read(buffer))) {
output.write(buffer, 0, n);
count += n;
}
return count;
}
}

An Iterator which mutates and returns the same object. Bad practice?

I'm writing GC friendly code to read and return to the user a series of byte[] messages. Internally I reuse the same ByteBuffer which means I'll repeatedly return the same byte[] instance most of the time.
I'm considering writing cautionary javadoc and exposing this to the user as a Iterator<byte[]>. AFAIK it won't violate the Iterator contract, but the user certainly could be surprised if they do Lists.newArrayList(myIterator) and get back a List populated with the same byte[] in each position!
The question: is it bad practice for a class that may mutate and return the same object to implement the Iterator interface?
If so, what is the best alternative? "Don't mutate/reuse your objects" is an easy answer. But it doesn't address the cases when reuse is very desirable.
If not, how do you justify violating the principle of least astonishment?
Two minor notes:
I'm using Guava's AbstractIterator so remove() isn't really of concern.
In my use case the user is me and the visibility of this class will be limited, but I've tried to ask this generally enough to apply more broadly.
Update: I'm accepting Louis' answer because it has 3x more votes than Keith's, but note that in my use case I'm planning to take the code that I left in a comment on Keith's answer to production.

EnumMap did essentially exactly this in its entrySet() iterator, which causes confusing, crazy, depressing bugs to this day.
If I were you, I just wouldn't use an Iterator -- I'd write a different API (possibly quite dissimilar from Iterator, even) and implement that. For example, you might write a new API that takes as input the ByteBuffer to write the message into, so users of the API could control whether or not the buffer gets reused. That seems reasonably intuitive (the user can write code that obviously and cleanly reuses the ByteBuffer), without creating unnecessarily cluttered code.

I would define an intermediate object which you can invalidate. So your function would return an Iterator<ByteArray>, and ByteArray is something like this:
class ByteArray {
private byte[] data;
ByteArray(byte[] d) { data = d; }
byte[] getData() {
if (data == null) throw new BadUseOfIteratorException();
return data;
}
void invalidate() { data = null; }
}
Then your iterator can invalidate the previously returned ByteArray so that any future access (via getData, or any other accessor you provide) will fail. Then at least if someone does something like Lists.newArrayList(myIterator), they will at least get an error (when the first invalid ByteArray is accessed) instead of silently returning the wrong data.
Of course, this won't catch all possible bad uses, but probably the common ones. If you're happy with never returning the raw byte[] and providing accessors like byte get(int idx) instead, then it should catch all cases.
You will have to allocate a new ByteArray for each iterator return, but hopefully that's a lot less expensive than copying your byte[] for each iterator return.

Just like Keith Randall I'd also create Iterator<ByteArray>, but working quite differently (the annotations below come from lombok):
#RequiredArgsConstructor
public class ByteArray {
#Getter private final byte[] data;
private final ByteArrayIterable source;
void allowReuse() {
source.allowReuse();
}
}
public class ByteArrayIterable implements Iterable<ByteArray> {
private boolean allowReuse;
public allowReuse() {
allowReuse = true;
}
public Iterator<ByteArray> iterator() {
return new AbstractIterator<ByteArray>() {
private ByteArray nextElement;
public ByteArray computeNext() {
if (noMoreElements()) return endOfData();
if (!allowReuse) nextElement =
new ByteArray(new byte[length], ByteArrayIterable.this);
allowReuse = false;
fillWithNewData(lastElement.getData());
}
}
}
}
Now in calls like Lists.newArrayList(myIterator) always a new byte array gets allocated, so everything works. In your loops like
for (ByteArray a : myByteArrayIterable) {
a.allowReuse();
process(a.getData());
}
the buffer gets reused. No harm may result, unless you call allowReuse() by mistake. If you forget to call it, then you get worse performance but correct behavior.
Now I see it could work without ByteArray, the important thing is that myByteArrayIterable.allowReuse() gets called, which could be done directly.

What is more efficient StringBuffer new() or delete(0, sb.length())?

It is often argued that avoiding creating objects (especially in loops) is considered good practice.
Then, what is most efficient regarding StringBuffer?
StringBuffer sb = new StringBuffer();
ObjectInputStream ois = ...;
for (int i=0;i<1000;i++) {
for (j=0;i<10;j++) {
sb.append(ois.readUTF());
}
...
// Which option is the most efficient?
sb = new StringBuffer(); // new StringBuffer instance?
sb.delete(0,sb.length()); // or deleting content?
}
I mean, one could argue that creating an object is faster then looping through an array.

First StringBuffer is thread-safe which will have bad performance compared to StringBuilder. StringBuilder is not thread safe but as a result is faster. Finally, I prefer just setting the length to 0 using setLength.
sb.setLength(0)
This is similar to .delete(...) except you don't really care about the length. Also probably a little faster since it doesn't need to 'delete' anything. Creating a new StringBuilder (or StringBuffer) would be less efficient. Any time you see new Java is creating a new object and placing that on the heap.
Note: After looking at the implementation of .delete and .setLength, .delete sets length = 0, and .setLength sets every thing to '\0' So you may get a little win with .delete

Just to amplify the previous comments:
From looking at source, delete() always calls System.arraycopy(), but if the arguments are (0,count), it will call arraycopy() with a length of zero, which will presumably have no effect. IMHO, this should be optimized out since I bet it's the most common case, but no matter.
With setLength(), on the other hand, the call will increase the StringBuilder's capacity if necessary via a call to ensureCapacityInternal() (another very common case that should have been optimized out IMHO) and then truncates the length as delete() would have done.
Ultimately, both methods just wind up setting count to zero.
Neither call does any iterating in this case. Both make an unnecessary function call. However ensureCapacityInternal() is a very simple private method, which invites the compiler to optimize it nearly out of existence so it's likely that setLength() is slightly more efficient.
I'm extremely skeptical that creating a new instance of StringBuilder could ever be as efficient as simply setting count to zero, but I suppose that the compiler might recognize the pattern involved and convert the repeated instantiations into repeated calls to setLength(0). But at the very best, it would be a wash. And you're depending on the compiler to recognize the case.
Executive summary: setLength(0) is the most efficient. For maximum efficiency, pre-allocate the buffer space in StringBuilder when you create it.

The delete method is implemented this way:
public AbstractStringBuilder delete(int start, int end) {
if (start < 0)
throw new StringIndexOutOfBoundsException(start);
if (end > count)
end = count;
if (start > end)
throw new StringIndexOutOfBoundsException();
int len = end - start;
if (len > 0) {
System.arraycopy(value, start+len, value, start, count-end);
count -= len;
}
return this;
}
As you can see it doesn't iterate through the array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java MemoryMapping big files - java

Related

Why java doesn't give a bit read api?

What is the fastest and most concise/correct way to implement this model class backed by values in a 2-dimensional array?

How to Declare a Byte Array of Infinite Size/Dynamic in Java?

An Iterator which mutates and returns the same object. Bad practice?

What is more efficient StringBuffer new() or delete(0, sb.length())?

Categories

Resources