Existing File based implementation of java.util.Map

Existing File based implementation of java.util.Map - java

I'm working on a project that uses custom Map<String, Entry> (where Entry is a pair of ints) implementation based on B-tree to store from 10 to 100 millions of records, the code for this class is slow and dirty. I need efficient implementation of the Map, which uses a file for storage and a small amount of memory.
I searched and found that Java Edition Of Berkeley DB has java.util.Collection API (including Map), but it seems superfluous to use a fully fledged database for this purpose (it uses directory with many files, has several additional threads for management). Is there a simpler solution?

I had this very same problem recently and looked at everything under the sun, including NoSQL and caches. You want a disk/file based/backed hashmap.
Berkeley DB Java Edition is by far the best. It's fast, scalable, and complete, but you can't distribute it to clients without distributing your source code or buying the commercial version from Oracle.
The only other choice, besides reinventing the wheel, is JDBM2. It also has a hashmap and a tree map. You are responsible for regularly flushing to disk to prevent OutOfMemoryError and it isn't near as fast as Berkeley DB but it is a very good 2nd choice.

Take a look at Kyoto Cabininet, a disk-backed DBM implementation. I've used the previous version, Tokyo Cabinet - it was dead easy to use, basically just like a native Map, and very fast.

JDBM is a lightweight, pure Java B-Tree implementation.

Related

What difficulties should I expect if I write a NoSQL db using golang but want to run Hadoop mapreduce on it?

I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.

Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb

Is there any existing Java library that allows you to do fast, in-memory lookups of zipcodes (bonus, state and city) from latitude/longitude?

I have seen many so-called "reverse geocoding" libraries in various languages; all depend on calling an external provider via REST or some similar method. However, you cannot call a REST provider if you must handle thousands of requests per second.
On the other hand, the problem should be simple to solve - CSV-based databases are available for free with this information. The issue is the time and cost of writing an efficient and well-tested in-memory search implementation, versus downloading or buying an existing one.
I can't find any after a lot of looking, but I can't believe there can't be one. Is there any pre-written library that does this?
This question:
Fastest way to find the location(zip, city, state) given latitude/longitude
came the closest, but essentially indicates how to write the solution, not that there is anything available off the shelf. But there must be some library everyone uses for this. A dozen people a day must have this problem.

Spatial databases (e.g. Postgresql with PostGis) use algorithms which are fast in looking up data for given latitude/longitude information. As you want to use a Java library and have it in memory you could look at the H2 Spatial database. I have never used it, so I can't comment on its performance.
Edit: Hm, looking closer at the link I've provided shows that this is a planned feature... Personally I'd simply use Postgresql/PostGis (with or without Java as server frontend) and be done with it. If your server has enough memory it will still fit the "in-memory" requirement. Naturally it does not fit the Java library requirement. There is however JSI, which could be used in memory and with Java.

Comparison of NoSQL Databases for Java

I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D

We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.

Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).

It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.

Choosing between Berkeley DB Core and Berkeley DB JE

I'm designing a Java based web-app and I need a key-value store. Berkeley DB seems fitting enough for me, but there appears to be TWO Berkeley DBs to choose from: Berkeley DB Core which is implemented in C, and Berkeley DB Java Edition which is implemented in pure Java.
The question is, how to choose which one to use? With web-apps scalability and performance is quite important (who knows, maybe my idea will become the next Youtube), and I couldn't find easily any meaningful benchmarks between the two. I have yet to familiarize with Cores Java API, but I find it hard to believe that it could be much worse than Java Editions, which seems to be quite nice.
If some other key-value store would be much better, feel free to recommend that too. I'm storing smallish binary blobs, and keys probably will be hashes of the data, or some other unique id.

I have quite a bit of experience using both BDB-JE and BDB-core with Java. Deciding which one to use is quite simple: If you want concurrency, use BDB-JE. If you want scalability, use BDB-core.
BDB-JE breaks down performance-wise with large databases due to its file format and its reliance on Java garbage collection to clean up evicted cache entries. Expect long garbage collection pauses or spend a lot of time tuning magic GC settings. The file format has issues too, because the background cleaner threads have to spend a lot of time cleaning up garbage created by early cache evictions. If your database fits in RAM, BDB-JE works quite well.
BDB-core relies on a page-locking strategy, and highly concurrent applications experience a lot of deadlocks. If you can randomly order operations it reduces the deadlock potential, but it never eliminates it. Because BDB-core stores data in a more traditional way, it scales to super large sizes with predictable and expected performance degradation. Because its cache is not managed by a garbage collector, it can be quite large and not cause any pauses.

If you derive a common interface to these, and have a suitable set of unit tests, you should be able to swap between the two trivially at a later date (perhaps when you really need to make a decision based on hard facts that are not available right now)

I faced the same problem and decided to go with the Java edition, mainly because of its portability(I need something that would ran even on mobile devices). There are also the Direct Persistence Layer (DPL) API and the fact that the whole db is a single jar makes its deployment fairly simple.
The recent version 4 brought in High availability and performance improvements. There is also the fact that long running java applications can achieve such an optimization, that they would surpass native C applications performance in some scenarios.
It's a natural fit for any Java application - desktop or web.

I while ago I was having the same question, after doing some benchmarks I found that hash mode in the native edition is much faster and storage efficient than anything the java edition has to offer, so I decided to go with the native implementation.
I suggest you do your own benchmarks for the storage capacities you expect and decide if the Java edition is fast enough.
if it is, or if performance is not a big issue for you (it's critical for me), just go with the Java edition. otherwise go with the native one (assuming you see the same performance boost for your own use case).
btw:
my benchmark was test the speed of querying random keys out of 20,000,000 records, where the key is a string and the value is an int (4 bytes).
I saw that inserts (populating the benchmark) was much faster with the native version, and queries was twice as fast.
(This is not due to Java shortcoming but because the Java version is not of the same version as the native version - 4.0 vs 4.8 IIRC).

I decided to go with the Java Edition, simply because its possible to embed the database runtime within the same deployable. This was an important feature for my setup. I haven't benchmarked between core and JE, but I have seen great performance compared with other key-value stores that I tested when first evaluating database stores.
If you're creating a web-application though, then concurrency might be very important to you in the long run.

Java Fast Data Storage & Retrieval

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:
Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day
What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.
For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.
For persistency, we have to write our data to disk after all. Supposing optimal organization
this can be done as background activity, not affecting latency.
However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.
Data model is not restricting here: most of the methods support access based on a unique key.
We have to decide,
if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.
Solutions might be
self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
MemcacheDB, Redis other caching databases with persistency,
popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.
A list of NoSQL tools can be found eg. here.
Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

Have a look at LinkedIn's Voldemort.

If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

What about something like CouchDB?

I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP),
using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue)
I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.

MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.
In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.

Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.

MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.

Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?
You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).
However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.

If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.
Another alternative is Tokyo Cabinet, a modern DBM implementation.

How bad would it be if you lose a couple of entries in case of a crash?
If it isn't that bad the following approach might work for you:
Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.
Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.
Talk to a file system expert how to make this really fast
It is simple and it might be fast.
Of course you lose transactions including the ACID principles.

Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.

How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?
If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.
Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?
Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.

Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.
ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.

Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.