Purpose of the Lazy* classes in the MongoDB Java API - java

The MongoDB Java driver documentation of the packet org.bson mentions various "Lazy" versions of other classes. Unfortunately the Javadocs of these classes can barely be called documentation.
What is their purpose and how does their behavior differ from the normal versions?

Under normal operation, the driver creates and consumes documents using the DBObject Map-like interface. When inserting documents, it iterates over the map to convert it the corresponding BSON representation. When querying, it creates new documents by putting key-value pairs into the map.
But there are times when you want to work with raw BSON and not pay the cost of all this serialization and deserialization. That's what the lazy DBObject implementations are for. Instead of treating them as a map, a custom encoder instead writes the bytes directly to the BSON stream. Similarly, a custom decoder writes the raw bytes directly into the lazy DBObject.
In this context, the meaning of the term lazy is that, since the lazy equivalents still have to implement the DBObject interface, they do so by "lazily" interpreting the raw BSON byte array that they contain.
One last note: the lazy DBObject classes are very likely not going to be included in the upcoming 3.0 release of the driver, as the entire serialization is changing in a way that is not compatibile with lazy DBObjects. There will be equivalent functionality, but not API compatible.

Related

Java objects to Hbase

I'm currently using KITE API + AVRO to handle java objects to HBase. But due to various problems I'm looking for an alternative.
I've been reading about:
Phoenix
Native Hbase Api.
But there is more an alternative? .
The idea is to save and to load the java objects to Hbase and uses them in a java application.
If you're storing your objects in the Value portion of the KeyValue pair, then it's really just an array / sequence of bytes (i.e. in the code for KeyValue class there is a getValue method which returns a byte array).
At this point, you're down to object serialization and there are a host of libraries you can use with various ease of use, performance characteristics, and details of implementation. Avro is one type of serialization library which stores the schema with each record, but you could in theory use:
Standard Java serialization (implement Serializable)
Kryo
Protobuf
Just to name a few. You may want to investigate the various strengths of each library & its tradeoffs and balance that against the type of objects you plan to store (i.e. are they all effectively the same type of object or do they vary widely in type? Are they going to be long lived i.e. years and have the expectation of schema evolution & backwards compatibility etc.)
Phoenix is a JDBC api to HBase. It handles most SQL types (except intervals) - you can store arbitrary java objects using the binary data type. But if you are only storing binary data, you could easily stick with HBase. If you can coerce your data in standard SQL types, Phoenix may be a good option.
If you want to stick with the Hadoop/HBase code you can have your complex class implement org.apache.hadoop.io.Writable.
// Some complex java object
// that implements org.apache.hadoop.io.Writable
SomeObject myObject = new SomeObject();
// write the object to a byte array
// for storage in HBase
byte[] byteArr = WritableUtils.toByteArray(myObject);
Reference

Does Jackson use stable binary serialization?

I have an external key-value storage, which contains a set of binary values per key. These binary values are Scala objects serialized using Jackson 2.5 with a ScalaPlugin.
To preserve set semantics in the storage, binary representation of an object should be stable, i.e. serializing it two times should result in the same sequence of bytes. Does Jackson have this guarantee? If it uses JSON serialization as an intermediate step for example, there is no node ordering, so binary serialization will not be stable.
This may get even more tricky if an object contains unordered structures internally (like sets or maps). Does Jackson handle that?
If not, are there sound alternatives?
Update:
I have found com.fasterxml.jackson.databind.SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS which resolves a part of the question.
You can use the option SORT_PROPERTIES_ALPHABETICALLY as mentioned in the answers of this question to achieve stable serialization (together with the option ORDER_MAP_ENTRIES_BY_KEYS that you already mentioned).
Some more background is explained in these emails:
"... in absence of explicit declaration (via #JsonPropertyOrder), and general mechanisms (sort-alphabetically; Object Id and Type Id preceding other properties), there is no defined ordering."
and
"One thing to note is that Oracle did change behavior of JDK 7, such that previously stable ordering of methods and fields returned by Introspection become arbitrary." ... "Existing default ordering is based on simple traversal and ordering that JDK provides; ...".
(copied from the email from Tatu Saloranta).

How to read schema-less documents using Java from MongoDB

MongoDB gives the ability to write documents of any structure i.e. any number and types of key/value pairs can be written. Assuming that I use this features that my documents are indeed schema-less then how do I manage reads, basically how does the application code ( I'm using Java ) manage reads from the database.
The java driver reads and writes documents as BasicBSONObjects, which implement and are used as Map<String, Object>. Your application code is then responsible for reading this map and casting the values to the appropriate types.
A mapping framework like Morphia or Spring MongoDB can help you to convert BSONObject to your classes and vice versa.
When you want to do this yourself, you could use a Factory method which takes a BasicBSONObject, checks which keys and values it has, uses this information to create an object of the appropriate class and returns it.

Serializing objects with changing class source code

Note: Due to the lack of questions like this on SO, I've decided to put one up myself as a Q&A
Serializing objects (using an ObjectOutputStream and an ObjectInputStream) is a method for storing an instance of a Java Object as data that can be later deserialized for use. This can cause problems and frustration when the Class used to deserialize the data does not remain the same (source-code changes; program updates).
So how can an Object be serialized and deserialized with an updated / downgraded version of a Class?
Here are a few common ways of serializing an object that can be deserialized in a backwards-compatible way.
1. Store the data in the JSON format using import and export methods designed to save all fields needed to recreate the instance. This can be made backwards-compatible by including a version key that allows for an update algorithm to be called if the version is too low. A common library for this is the Google Gson library which can represent Java objects in JSON as well as normally editing a JSON file.
2. Use the built-in java Properties class in a way similar to the method described above. Properties objects can be later stored using a stream (store()) written as a regular Java Properties file, or saved in XML (storeToXML()).
3. Sometimes simple objects can be easily represented with key-value pairs in a place where storing them in a JSON, XML, or Properties file is either too complicated or not neccessary (overkill one could say). In this case, an effective way of serializing the object could be using the ObjectOutputStream class to serialize a HashMap object containing key-value pairs where the key could be a String and the value could be an Object (HashMap<String,Object>). This allows for all of the object's fields to be stored as well as including a version key while providing much versatility.
Note: Although serializing an object using the ObjectOutputStream for persistence storage is normally considered bad convention, it can be used either way as long as the class' source code remains the same.
Also Note about versioning: Changes to a class can be safely made without disrupting deserialization using an ObjectOutputStream as long as they are a compatible change. As mentioned in the Versioning of Serializable Objects chapter of the Object Serialization Specification:
A compatible change is a change that does not affect the contract
between the class and its callers.

Storing Serializable Objects in the Database

I'm writing an application which needs to write an object into database.
For simplicity, I want to serialize the object.
But ObjectOuputStream needed for the same purpose has only one constructor which takes any subclass of OutputStream as parameter.
What parameter should be passed to it?
You can pass a ByteArrayOutputStream and then store the resulting stream.toByteArray() in the database as blob.
Make sure you specify a serialVersionUID for the class, because otherwise you'll have hard time when you add/remove a field.
Also consider the xml version for object serialization - XMLEncoder, if you need a bit more human-readable data.
And ultimately, you may want to translate your object model to the relational model via an ORM framework. JPA (Hibernate/EclipseLink/OpenJPA) provide object-relational mapping so that you work with objects, but their fields and relations are persisted in a RDBMS.
Using ByteArrayOutputStream should be a simple enough way to convert to a byte[] (call toByteArray after you've flushed). Alternatively there is Blob.setBinaryStream (which actually returns an OutputStream).
You might also want to reconsider using the database as a database...
e.g. create ByteArrayOutputStream and pass it to ObjectOuputStream constructor
One thing to add to this. java serialization is a good, general use tool. however, it can be a bit verbose. you might want to try gzipping the serialized data. you can do this by putting a GZIP stream between the object stream and the byte stream. this will use a small amount of extra cpu, but that is often a worthy tradeoff to shipping the extra bytes over the network and shoving them in a db.

Categories