Memory mapped files in java: too many questions?

Memory mapped files in java: too many questions? - java

Memory mapped files are (according to the spec) largely dependent on the actual implementation of the OS and a number of these unknown aspects are already explained in the javadoc. However I have some additional questions and not sure where to turn to for answers.
Suppose application A maps a file to memory from position=0 to size=10.
I would assume the OS needs a continuous piece of memory to map it? Or is this dependent on implementation?
Now suppose we have an application B that maps from position=0 to size=11.
Are the first 10 bytes shared or is it an entirely different mapping? This relates back to the continuous memory question.
If we want to use mapped files for IPC, we need to know how the data is reflected in other applications, so if B writes to memory, does A see this?
However as I read the spec, this depends on the OS. This makes it dangerous to use for general purpose IPC as it destroys portability right?
Additionally suppose the OS does support it, so B writes to memory, A sees the change, what happens if we do this:
B.write("something");
A.write("stuff");
A.read();
What exactly will A read?
Or put otherwise:
how are the file pointers managed?
How does it work with concurrency, is there cross application locking?

You can assume that every operating system will perform the memory mapping in terms of blocks which usually have a size which is either a power of two of a multiple of a power of two and significantly larger than 11 bytes.
So regardless of whether you map from 0 to 10 or from 1 to 11, the underlying system will likely establish a mapping from 0 to blocksize to a logical address X which will be perfectly hidden to the Java programmer as the returned ByteBuffer has its own address pointer and capacity so it can always be adjusted so that, e.g. position 0 yield to address X + 1. But whether the underlying system or Java’s MappedByteBuffer performs the necessary translation is not important.
Usually, operating systems will end up using the same physical memory block for a mapping to the same region of the same file so it is a reasonable way of establishing IPC, but as you have already guessed, that is indeed OS dependent and not portable. Still, it might be useful if you make it optional and let the user who knows that his system supports it, can enable it.
Regarding your question about the two writes, of course, if two applications write to the same location concurrently, the result is entirely unpredictable.
Mapping a file region is independent from locking but you may use the file channel API to lock the region you are mapping to gain exclusive access.

Related

Why does an arrays memory in java need to be divisible by 8 bytes?

While researching about arrays in java, I read the memory usage contains 12 bytes of header object plus the storage of the element of the particular data type. But if the final value is not divisible by 8 bytes padding needs to be added. Why is this? I tried to search about this but did not find an answer.
https://study.com/academy/lesson/java-arrays-memory-use-performance.html#:~:text=The%20memory%20allocation%20for%20an,a%20multiple%20of%208%20bytes.
This was the website where I read about this.

The details you’re specifying here are specific to particular implementations of the JVM and are not generally required to be true. For example, as a project many years back I put together an implementation of a JVM written purely in JavaScript, where the implementation relied on underlying JS primitives and therefore couldn’t control the exact sizes of the objects being created. I have no idea how big the underlying array objects I allocated in JavaScript to represent a single Java array was, and I certainly didn’t manually pad it. :-)
As #cpvmrd mentioned, a specific implementation of the JVM might decide to pad things this way either for performance reasons (to align objects to a particular boundary to make 64-bit loads fast) or for reasons of processor alignment (some operations either lead to performance degradation or to bus errors if data of size N are loaded from an address that isn’t a multiple of N). But that’s up to the implementation to sort out.

It is for performance.
The processor reads the memory in "chunks" of a definite size (word). On 64-bits CPU, memory is generally configured to return one 64-bits word per address access. Intel CPUs can perform accesses on non-word boundary，but there is a performance penalty as internally the CPU performs two memory accesses and a math operation to load one word.

Most secure way to load sensitive information into protocol buffers

My application uses Google protocol buffers to send sensitive data between client and server instances. The network link is encrypted with SSL, so I'm not worried about eavesdroppers on the network. I am worried about the actual loading of sensitive data into the protobuf because of memory concerns explained in this SO question.
For example:
Login login = Login.newBuilder().setPassword(password)// problem
.build();
Is there no way to do this securely since protocol buffers are immutable?

Protobuf does not provide any option to use char[] instead of String. On the contrary, Protobuf messages are intentionally designed to be fully immutable, which provides a different kind of security: you can share a single message instance between multiple sandboxed components of a program without worrying that one may modify the data in order to interfere with another.
In my personal opinion as a security engineer -- though others will disagree -- the "security" described in the SO question to which you link is security theater, not actually worth pursuing, for a number of reasons:
If an attacker can read your process's memory, you've already lost. Even if you overwrite the secret's memory before discarding it, if the attacker reads your memory at the right time, they'll find the password. But, worse, if an attacker is in a position to read your process's memory, they're probably in a position to do much worse things than extract temporary passwords: they can probably extract long-lived secrets (e.g. your server's TLS private key), overwrite parts of memory to change your app's behavior, access any and all resources to which your app has access, etc. This simply isn't a problem that can be meaningfully addressed by zeroing certain fields after use.
Realistically, there are too many ways that your secrets may be copied anyway, over which you have no control, making the whole exercise moot:
Even if you are careful, the garbage collector could have made copies of the secret while moving memory around, defeating the purpose. To avoid this you probably need to use a ByteBuffer backed by non-managed memory.
When reading the data into your process, it almost certainly passes through library code that doesn't overwrite its data in this way. For example, an InputStream may do internal buffering, and probably doesn't zero out its buffer afterwards.
The operating system may page your data out to swap space on disk at any time, and is not obliged to zero that data afterwards. So even if you zero out the memory, it may persist in swap. (Encrypting swap ensures that these secrets are effectively gone when the system shuts down, but doesn't necessarily protect against an attacker present on the local machine who is able to extract the swap encryption key out of the kernel.)
Etc.
So, in my opinion, using mutable objects in Java specifically to be able to overwrite secrets in this way is not a useful strategy. These threats need to be addressed elsewhere.

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.

The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!

Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)

How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Optimising Java objects for CPU cache line efficiency

I'm writing a library where:
It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux)
Achieving high performance is a priority, to the extent that I care about CPU cache line efficiency in object access
In some areas, quite large graphs of small objects will be traversed / processed (let's say around 1GB scale)
The main workload is almost exclusively reads
Reads will be scattered across the object graph, but not totally randomly (i.e. there will be significant hotspots, with occasional reads to less frequently accessed areas)
The object graph will be accessed concurrently (but not modified) by multiple threads. There is no locking, on the assumption that concurrent modification will not occur.
Are there some rules of thumb / guidelines for designing small objects so that they utilise CPU cache lines effectively in this kind of environment?
I'm particularly interested in sizing and structuring the objects correctly, so that e.g. the most commonly accessed fields fit in the first cache line etc.
Note: I am fully aware that this is implementation dependent, that I will need to benchmark, and of the general risks of premature optimization. No need to waste any further bandwidth pointing this out. :-)

A first step towards cache line efficiency is to provide for referential locality (i.e. keeping your data close to each other). This is hard to do in JAVA where almost everything is system allocated and accessed by reference.
To avoid references, the following might be obvious:
have non-reference types (i.e. int, char, etc.) as fields in your
objects
keep your objects in arrays
keep your objects small
These rules will at least ensure some referential locality when working on a single object and when traversing the object references in your object graph.
Another approach might be to not use object for your data at all, but have global non-ref typed arrays (of same size) for each item that would normally be a field in your class and then each instance would be identified by a common index into these arrays.
Then for optimizing the size of the arrays or chunks thereof, you have to know the MMU characteristics (page/cache size, number of cache lines, etc). I don't know if JAVA provides this in the System or Runtime classes, but you could pass this information as system properties on start up.
Of course this is totally orthogonal to what you should normally be doing in JAVA :)
Best regards

You may require information about the various caches of your CPU, you can access it from Java using Cachesize (currently supporting Intel CPUs). This can help to develop cache-aware algorithms.
Disclaimer : author of the lib.

Clojure: Java FileSystem Primitives for Implementing Database Persistence

Context
This is purely for education purposes. I want to write a primitive database. Focus is NOT on performance; but just the principles behind databases. I have material already on locking / mutexes / transactions. What I know nothing about is writing to disk / guaranteeing persistence in unexpected hardware (say power) failures.
In order to have proper recovery / persistence, I need certain guarantees when writing files to disk.
Question:
For the above purposes, what types of file primitives (guarantees that file is written to disk? leaving a file open and appending to the log?) do I need? What does the JVM offer?
Thanks!

It's a huge area to talk about because of the many layers of abstraction surrounding discs these days, though from the JVM's perspective you pretty much depend on fsync to actually write your bits to disc once you call fsync you depend on these bits being on the disc. the rest is built on this.

To force data to be written to disk before the call to a write returns, you must use a FileChannel and call force.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.