I need to share data between two Java applications running on the same machine (two different JVMs). I precise that the data to be shared is large (about 7 GB). The applications must access the data very fast because they have to answer incoming queries at a very high rate. I don't want the applications to hold each one a copy of the data.
I've seen that one option is to use memory-mapped files. Application A gets the data from somewhere (let's say a database) and stores it in files. Then application B may access these files using java.nio. I don't know exactly how memory-mapped files work, I only know that the data is stored in a file and that this file (or a part of it) is mapped to a region of the memory (virtual memory?). So, the two applications can read-write the data in memory and the changes are automatically (I guess?) committed to the file. I also don't know if there is a maximum size for a file to be entirely mapped in memory.
My first question is what are the different possibilities for two applications to share data in this scenario (I mean taking into account that the amount of data is very large and that access to this data must be very fast)? I precise that this question is not related to memory-mapped I/O, it just to know what are the other ways to solve the same problem.
My second question is what are the pros and cons of using memory-mapped files?
Thanks
My first question is what are the different possibilities for two applications to share data?
As S.Lott points out, there's a lot of mechanisms:
OS-level message queues
OS-level POSIX shared memory segments (persist after process death)
OS-level memory mappings (could be anonymous or file-backed)
OS-level anonymous pipes (unidirectional)
OS-level named pipes (unidirectional)
OS-level sockets (bidirectional) -- whether AF_UNIX or AF_INET or AF_INET6
OS-level shared global memory -- suitable for multi-threaded programs
Storing data in files
Application-level message queues
Application-level blackboard-style tuplespaces
Application-level key/value stores
Application-level remote procedure call frameworks -- many are available
Application-level web-based frameworks
My second question is what are the pros and cons of using memory-mapped files?
Pros:
very fast -- depending upon how you access the data, potentially zero-copy mechanisms can be used to operate directly on the data with no speed penalties. Care must be taken to update objects in a consistent manner.
should be very portable -- available on Unix systems for probably 25 years (give or take), and apparently Windows has mechanisms too.
Cons:
Single-system sharing. If you want to distribute your application over multiple machines, shared memory isn't a great option. Distributed shared memory systems are available, but they feel very much like the wrong interface to my way of thinking.
Even on a single system, if the memory is located on a single NUMA node but needed to be accessed by processors from multiple nodes, the inter-node requests may significantly slow processing compared to giving each node their own segment of the memory.
You can't just store pointers -- everything must be stored as offsets to base addresses, because the memory may be mapped at different locations in different processes. I have no idea what this means for Java objects, though presumably someone smart did their best to make it transparent to Java programmers. If you're not using their provided mechanisms, then you probably must do the work yourself. (Without actual pointers in Java, perhaps this is not very onerous.)
Updating objects consistently has proven to be very difficult. Passing immutable objects in message-passing systems instead generally results in programs with fewer concurrency bugs. (Concurrent programming in Erlang feels very natural and straight-forward. Concurrent programming in more imperative languages tends to introduce a huge pile of new concurrency controls: semaphores, mutexes, spinlocks, monitors).
Memory mapped files sounds like a headache. A simple option and less error prone would be to use a shared database with a cluster aware cache. That way only writes go down to the database and reads can be served from the cache.
As an example of how to do this in hibernate see http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html#performance-cache
Related
I suppose this is not possible. But I am looking at best way to separate different layers of my service yet be able to access layers quickly or without overhead of IPC/RMI.
The main programming language I am using is java, but can use C++ if required.
What we have right now is a server that host database and access control. And we use RMI for consumers to request data. This slow and doesn't scale very well.
We need performance and scalability which we dont have at the moment.
What we are thinking of is using a layered architecture with database at base, access control ontop of it along with a notification bus to notify clients of changes in database.
The main problem is the overhead of communication that we want to avoid/or minimize.
Is there any magic thread that can run in two context (switch context) and share information that way. I know the short answer would be no, but what are the options?
Update
We are currently using Java RMI.
Our base layer will provide an API that can be used to create plugins that will run on top. So its not a fixed collectors/consumer we have. We can have 5-6 collectors running and same amount of consumers.
We can have upto 1000 consumers.
My first suggestion is that you should buy a book (or find an online tutorial) on building scalable applications, because you seem to be pretty lost.
Sharing a thread between processes doesn't make sense at any level - it is meaningless, but you can share the data that the thread accesses, which is probably what you want.
The fastest method will be C based IPC (e.g., shared memory, semasphores, etc: Shmget). You say you want to avoid the overhead of IPC, but really, it isn't going to get any faster than that.
But why do you want multiple processes? If you are worried about the overhead of communicating between processes, just have your threads in one process? There is no reason your different layers have to be in different processes.
But anyway, I am not convinced that your original statement that RMI is slow and doesn't scale is completely correct. If it is not scaling, you are probably not using the right framework. Maybe you have an issue that you only have one RMI end point on the server. Have you considered an J2EE system with stateless session beans?
Without knowing about your requirements, it is hard to say.
It is not possible in general to share thread between two processes due to OS design. The problem of sharing data between two or more processes is usually solved by sharing files, sharing database or sharing messages (which in turn can be synchronous or asynchronous), having processes communicate via pipes, say in Linux, or even sharing memory. You scenario description is not very precise, you need to describe all processes and how information is supposed to flow, what triggers information flow, etc.
Most likely you need high performance messaging library, https://github.com/real-logic/Aeron/ is one. But to get precise answer you would need to describe better what overhead exactly you want to minimize.
If your goal is to notify users, you should consider publish/subscribe messaging (pub/sub). There are many middleware vendors out there that provide this architecture though most are expensive in production scenarios. For open source, check out http://redis.io/topics/pubsub. (No affiliation.)
We are looking for a shared memory mechanism to transfer large amounts of data between processes without copying, in Java. It has to be portable (including Windows). Is there such a thing? We were thinking about using mmap-ed files, as they are portable, however their contents are written to disk which is not desirable. Are there alternatives?
Otherwise, Windows has page-file-backed sections; is there an easy way to use these in Java? We are probably ok if we can use some other shared memory mechanism on *nix and those on Windows.
There is a couple of solutions in OpenHFT. These support a rolling queue which can read and written to concurrently, and a SharedHashMap which is entirely off heap. In Linux you can "write" to a tmpfs filesystem and on Windows you can use a Ramdrive.
These libraries support replication between machines over TCP. (The replication for Chronicle has been used for a while but for SharedHashMap, replication is still in beta)
While this is portable across OSes, it uses some features internal to OpenJDK/HotSpot JVMs and so it doesn't work on the IBM JVM AFAIK.
Note: the libraries support reading and writing data using a forma of serialization which doesn't create garbage, or using in place, off heap data structures. i.e. you don't need to deserialize the whole object to access a portion of it.
I need to implement a disk-backed queue which can accept real-time profiling data from multiple threads and then upload that data over potentially faulty transports. Initially targeted at Java but long-term we will need to use the same mechanism in Objective-C, Flash, JavaScript. Targeted at android Java as well as desktop.
This will be contained within a single process, so an MQ solution is probably out. Performance is a point of significant consideration, meaning we'd trade some reliability for performance.
I'm curious about two things:
Given the above architecture, is there any available technology that'll completely or partially solve this problem?
Given the goal of eventually re-implementing or ideally re-using this mechanism in different platforms, is there any way to build this in a way that can be easily used in say both Objective-C & Android Java?
How's this architecture look?
In case you want to keep limited amount of data (circular log), and able to reserve fixed amount of persistent memory for it, then most effective solution is memory-mapped buffers. Persister is simply a cache of several buffers, serving both profiling queue and uploader.
When reimplementing it on other platforms, chances are that the platform has no mapping facility. In this case, buffers can be read and written directly. This can be less efficient than mapping to memory, but still no less efficient than other solutions (e.g. embedded database).
As for the architecture, the picture does not reflect the situation when data is read from persister (or else what for is persister needed?). Then, profiling queue actually embrace the whole data (including persistent), and what is named as profiling queue is the buffers in main memory, they can be not contiguous, so better name is buffer cache than queue.
I have some maps that contains cached data from db. Currently 5 instance of the same server is running on same machine in different JVM. How can I share maps between JVM? cache is write once and read many. Currently the problem is because of this cache JVM footprint is very big. So storing this map in all JVM is consuming lot of memory. I need some solution which may not consume much cpu time. Is there way to do this in the same way class sharing is done between JVM?
Thanks
Nikesh PL
Basically, you can't: those are two different address spaces.
You could serialize one and read it from the other, but that wouldn't be like sharing them.
How about a process to manage the cache, and a quick, low-bandwidth interface that your application programs can use to access the data?
Why dont you look at coherence a project from oracle. Its not free but you can download and test it for free on a development system. It does precisely what you are looking for. It is used as a cache for storing database data but is ultimately a map of keys and values. Its pretty simple to set up and use. Here's a link to get you started:
http://download.oracle.com/docs/cd/E13924_01/coh.340/e14135.pdf
I am looking at possible technology choices for queues (or perhaps streams are a better description) in a JVM-based system.
Some requirements:
Must be accessible from the JVM / Java.
Queues must support sizes larger than the JVM heap, possibly bigger than all available RAM. Thus, support for utilizing the disk (or network) for storage is implied.
Queues do not currently need to be durable past the process lifetime.
Most uses of the queue will have a single producer and a single consumer. Concurrency for any particular queue is thus not an issue. (Obviously concurrency is important across queues.)
Queues are ad-hoc and temporary. They pop into existence, are filled, are drained, and go away.
Small queues should preferably stay in memory, then shift to slower storage based on resource availability. This requirement could be met above the queuing technology.
I am examining several options but am curious what options I am missing?
Use one of available JMS implementations. For example ActiveMQ or Qpid from Jakarta.
I ran across this FIFO queue with spill to disk which is kind of interesting and has some of the properties I'm looking for:
http://code.google.com/p/ashes-queue/
I have considered using Terracotta's BigMemory as a tool for pushing queue data into direct memory and off-heap.
How about using Redis as a messaging queue.It supports both in-memory and can be made persistent once data does not fit the RAM.
HSQLDB provides an in-process database engine where you can use RAM, the local disk or a network server to store the database. That might float your boat, especially if you want to seamlessly move to a network solution rather than the local disk later on. Transitioning from small to large queues would then involve moving data from one database to another. There are standard ways to do this, but they might be pretty slow.
There more I think about it, the more I think this is not a good match. For what it's worth, the in-memory DB is very fast in my experience.