Bitcask ok for simple and high performant file store? - java

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?

I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.

Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

Related

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Java: Optimal approach for storing and reading 1 billion data records

I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas
You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.
You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

Inserting to and searching a large amount of data in Java

I am writing a program in Java which tracks data about baseball cards. I am trying to decide how to store the data persistently. I have been leaning towards storing the data in an XML file, but I am unfamiliar with XML APIs. (I have read some online tutorials and started experimenting with the classes in the javax.xml hierarchy.)
The software has to major use cases: the user will be able to add cards and search for cards.
When the user adds a card, I would like to immediately commit the data to the persistant storage. Does the standard API allow me to insert data in a random-access way (or even appending might be okay).
When the user searches for cards (for example, by a player's name), I would like to load a list from the storage without necessarily loading the whole file.
My biggest concern is that I need to store data for a large number of unique cards (in the neighborhood of thousands, possibly more). I don't want to store a list of all the cards in memory while the program is open. I haven't run any tests, but I believe that I could easily hit memory constraints.
XML might not be the best solution. However, I want to make it as simple as possible to install, so I am trying to avoid a full-blown database with JDBC or any third-party libraries.
So I guess I'm asking if I'm heading in the right direction and if so, where can I look to learn more about using XML in the way I want. If not, does anyone have suggestions about what other types of storage I could use to accomplish this task?
While I would certainly not discourage the use of XML, it does have some draw backs in your context.
"Does the standard API allow me to insert data in a random-access way"
Yes, in memory. You will have to save the entire model back to file though.
"When the user searches for cards (for example, by a player's name), I would like to load a list from the storage without necessarily loading the whole file"
Unless you're expected multiple users to be reading/writing the file, I'd probably pull the entire file/model into memory at load and keep it there until you want to save (doing periodical writes the background is still a good idea)
I don't want to store a list of all the cards in memory while the program is open. I haven't run any tests, but I believe that I could easily hit memory constraints
That would be my concern to. However, you could use a SAX parser to read the file into a custom model. This would reduce the memory overhead (as DOM parsers can be a little greedy with memory)
"However, I want to make it as simple as possible to install, so I am trying to avoid a full-blown database with JDBC"
I'd do some more research in this area. I (personally) use H2 and HSQLDB a lot for storage of large amount of data. These are small, personal database systems that don't require any additional installation (a Jar file linked to the program) or special server/services.
They make it really easy to build complex searches across the datastore that you would otherwise need to create yourself.
If you were to use XML, I would probably do one of three things
1 - If you're going to maintain the XML document in memory, I'd get familiar with XPath
(simple tutorial & Java's API) for searching.
2 - I'd create a "model" of the data using Objects to represent the various nodes, reading it in using a SAX. Writing may be a little more tricky.
3 - Use a simple SQL DB (and Object model) - it will simply the overall process (IMHO)
Additional
As if I hadn't dumped enough on you ;)
If you really want to XML (and again, I wouldn't discourage you from it), you might consider having a look a XML database style solution
Apache Xindice (apparently retired)
Or you could have a look at some other people think
Use XML as database in Java
Java: XML into a Database, whats the simplest way?
For example ;)

Java: Advice on handling large data volumes. (Part Deux)

Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?
Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.
#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms
Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.
I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.
You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.
#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?
This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).
I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.
I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.

Categories