key - > value store with binary attachments - java

An extra requirement is that the attachments can be stored as a stream, as there might be potentially very large binaries that has to be saved. Videos etc.
I have looked at Voldemort and other key value stores, but they all seem to expect byte arrays, which is completely out of the question.
This should, preferrably, be written in Java, and be embeddable.
The use case is:
I have written a HTTP Cache library which has multiple backends.
I have a Memory based one (using hashmap and Byte arrays), Derby database, persistent hashmap with file attachment, EHCache with file attachment.
I was hoping there was something out there which didn't use the file system, or if it does, it's transparent from the API.
I am storing the Headers with some more meta information in a datastore. But I also need to store the payload of the HTTP response.
The HTTP response payload might be VERY big, thats why I need to use streaming.

Why is a byte[] value out of the question? Any object graph can be serialized into a byte array!
Have you looked at sleepycat's Berkeley DB (it's free)?
EDIT - having seen jhedding's comment, it seems like you need to store data which is too big to fit into a single JVM in one go. Have you:
Checked that it won't fot into a 64-bit JVM?
Tried using a network file system? (NAS or whatever)

Related

How to persist large strings in a POJO?

If I have a property of an object which is a large String (say the contents of a file ~ 50KB to 1 MB, maybe larger), what is the practice around declaring such a property in a POJO? All I need to do is to be able to set a value from one layer of my application and transfer it to another without making the object itself "heavy".
I was considering if it makes sense to associate an InputStream or OutputStream to get / set the value, rather than reference the String itself - which means when I attempt to read the value of the contents, I read it as a stream of bytes, rather than a whole huge string loaded into memory... thoughts?
What you're describing depends largely on your anticipated use of the data. If you're delivering the contents in raw form, then there may be more efficient ways to manage it.
For example, if your app has a web interface, your app may just provide a URL for a web server to stream the contents to the requester. If it's a CLI-based app, you may be able to get away with a simple file copy. If your app is processing the file, however, then perhaps your POJO could retain only the results of that processing rather than the raw data itself.
If you wish to provide a general pattern along the lines of using POJO's with references to external streams, I would suggest storing in your POJO something akin to a URI that tells where to find the stream (like a row ID in a database or a filename or a URI) rather than storing an instance of the stream itself. In doing so, you'll reduce the number of open file handles, prevent potential concurrency issues, and will be able to serialize those objects locally if needed without having to duplicate the raw data persisted elsewhere.
You could have an object that supplies a stream or an iterator every time you access it. Note that the content has to live on some storage, like a file. I.e your object will store a pointer (e.g. a file path) to the storage and every time someone access it, you open a stream or create an iterator and let that party read. Note also that in order to save on memory, whoever consumes it has to make sure not to store the whole content in memory.
However, 50KB or 1MB is really tiny. Unless you have like gigabytes (or maybe hundred megabytes), I wouldn't try to do something like that.
Also, even if you have large data, it's often simpler to just use files or whatever storage you'll use.
tl;dr: Just use String.

My JSON files are too big to fit into memory, what can I do?

In my program, I am reading a series of text files from the disk. With each text file, I process out some data and store the results as JSON on the disk. In this design, each file has its own JSON file. In addition to this, I also store some of the data in a separate JSON file, which stores relevant data from multiple files. My problem is that the shared JSON grows larger and larger with every file parsed, and eventually uses too much memory. I am on a 32-bit machine and have 4 GB of RAM, and cannot increase the memory size of the Java VM anymore.
Another constraint to consider is that I often refer back to the old JSON. For instance, say I pull out ObjX from FileY. In pseudo code, the following happens (using Jackson for JSON serialization/deserialization):
// In the main method.
FileYJSON = parse(FileY);
ObjX = FileYJSON.get(some_key);
sharedJSON.add(ObjX);
// In sharedJSON object
List objList;
function add(obj)
if (!objList.contains(obj))
objList.add(obj);
The only thing I can think to do is use streaming JSON, but the problem is that I frequently need to access the JSON that came before, so I don't know that stream will work. Also my data types on not only strings, which prevents me from using Jackson's streaming capabilities (I believes). Does anyone know of a good solution?
If you're getting to the point where your data structures are so large that you're running out of memory, you'll have to start using something else. I would recommend that you use a database, which will significantly speed up data retrieval and storage. It will also make the limit of your data structure the size of your hard drive, instead of the size of your RAM.
Try this page for an introduction to Java and Databases.
I can't believe that you really need nearly 4GB RAM only for text files and JSON.
I see three possible solutions.
Switch to plain text if it's possible. That is not that memory hungry.
Just open and close the files as you need them. You can order the files to a specific naming convention, like the first two/three/... digits of their hashes, and open them as you need them.
If you have so many data, you could maybe switch to a database. That would save a lot of resources.
I would prefer option 3 if it's possible for you.
you can make api and get responce.body from it

Need to send multiple objects through an http output stream

I am trying to send some very large files (>200MB) through an Http output stream from a Java client to a servlet running in Tomcat.
My protocol currently packages the file contents in a byte[] and that is placed a a Map<String, Object> along with some metadata (filename, etc.), each part under a "standard" key ("FILENAME" -> "Foo", "CONTENTS" -> byte[], "USERID" -> 1234, etc.). The Map is written to the URL connection output stream (urlConnection.getOutputStream()). This works well when the file contents are small (<25MB), but I am running into Tomcat memory issues (OutOfMemoryError) when the file size is very large.
I thought of sending the metadata Map first, followed by the file contents, and finally by a checksum on the file data. The receiver servlet can then read the metadata from its input stream, then read bytes until the entire file is finished, finally followed by reading the checksum.
Would it be better to send the metadata in connection headers? If so, how? If I send the metadata down the socket first, followed by the file contents, is there some kind of standard protocol for doing this?
You will almost certainly want to use a multipart POST to send the data to the server. Then on the server you can use something like commons-fileupload to process the upload.
The good thing about commons-fileupload is that it understands that the server may not have enough memory to buffer large files and will automatically stream the uploaded data to disk once it exceeds a certain size, which is quite helpful in avoiding OutOfMemoryError type problems.
Otherwise you are going to have to implement something comparable yourself. It doesn't really make much difference how you package and send your data, so long as the server can 1) parse the upload and 2) redirect data to a file so that it doesn't ever have to buffer the entire request in memory at once. As mentioned both of these come free if you use commons-fileupload, so that's definitely what I'd recommend.
I don't have a direct answer for you but you might consider using FTP instead. Apache Mina provides FTPLets, essentially servlets that respond to FTP events (see http://mina.apache.org/ftpserver/ftplet.html for details).
This would allow you to push your data in any format without requiring the receiving end to accommodate the entire data in memory.
Regards.

Protocol Buffer better than serialization?

I have a large data-structure which i'm serializing.At certain times i need to edit the values in the data-structure.But just for changing a small value i'll have to re-serialize it again instead of updating the changed value in file.I've heard of Google protocol buffer's.Will using it solve my problem of rewriting the file ? Is it a better option for me to use protocol buffer instead of Java serialization ?
Protocol buffers are themselves a serialization format, so they won't fundamentally change the picture (you'll still need to re-serialize after you change a value).
Google's docs claim that protocol buffers are more compact and faster to parse than XML (which seems plausible); don't know how they compare to native Java serialization.
Advantages of protocol buffers might be portability (if programs written in other languages need to read the file) and upgradability (you can add new fields to the data structure without breaking the file format).
A couple of points
There is an editor for Protocol Buffers binary format (http://code.google.com/p/protobufeditor/)
Protocol buffers has a text format that looks like:
# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "jdoe#example.com"
}
See:
Discussion: http://groups.google.com/group/protobuf/browse_thread/thread/04fc478088137bf3
Class: http://code.google.com/apis/protocolbuffers/docs/reference/java/com/google/protobuf/TextForm
Having said that, I would use a technology (JSon, Xml etc) that is already in use unless one of the following applies
You need the performance of protocol buffers
You already / plan to use protocol buffers
If you care about performance, don't use a text format for your data. If you want to modify the data without deserializing, you'll want to use a fixed record data format. You'll probably have to invent this manually. Then seek to the correct position in the file and rewrite just the changed field. You might look at DataOutputStream to get started or instead use a database such as HSQLDB to store and edit your data.
Thinking about this more, Unless your objects are very simple, I think a database would be a better way to go.
More info on DataOutputStream:
http://download.oracle.com/javase/tutorial/essential/io/datastreams.html
Java Databases:
http://java-source.net/open-source/database-engines
You need a serialization format that can directly be modified for example XML or JSON. Google protocol buffer is a binary format -- as the java serialization -- and thus can not be modifier directly...

Db4o, Java: Storing images using blobs

I want to store images in Db4o using Blobs. How can I store them and how do I get them out again?
Take a look at this question answer: How to stores and Pictures in Db40?
I repost my answer again, a bit updated with the links to the Java documentation:
There are two basic ways to handle Blobs. Either you store a blob as byte-array in the database or you use the special db4o-Blob-Type. Both have their advantages.
Advantages/Disadvantages with byte array:
The blobs are in the db4o-database-file. So there's only a single file to copy around.
Byte-arrays are part of the normal db4o-transaction and behave as expected.
When storing large blobs, you might run into the database-size limitation of db4o. (256 GB)
Advantages/Disadvantaged with db4o-blobs
The blobs are stored as regular files outside the database. This keeps the database itself small. Furthermore you just can access all stored blobs with a regular application.
You always need to copy the blob-directory and the database.
The db4o-blobs works outside the db4o transaction. This means that a db4o-blob behaves different than any other stored object (and the API is a little strange). However this allows to retrieve a db4o-blob without blocking the current transaction.
For your case I would store a byte[] array with the picture in the Person class. Or you create a special Image-class. This image-class contains then a byte-array with the picture. And a few methods to convert this byte-array from and to a Winforms-bitmap.

Categories