I'm currently using KITE API + AVRO to handle java objects to HBase. But due to various problems I'm looking for an alternative.
I've been reading about:
Phoenix
Native Hbase Api.
But there is more an alternative? .
The idea is to save and to load the java objects to Hbase and uses them in a java application.
If you're storing your objects in the Value portion of the KeyValue pair, then it's really just an array / sequence of bytes (i.e. in the code for KeyValue class there is a getValue method which returns a byte array).
At this point, you're down to object serialization and there are a host of libraries you can use with various ease of use, performance characteristics, and details of implementation. Avro is one type of serialization library which stores the schema with each record, but you could in theory use:
Standard Java serialization (implement Serializable)
Kryo
Protobuf
Just to name a few. You may want to investigate the various strengths of each library & its tradeoffs and balance that against the type of objects you plan to store (i.e. are they all effectively the same type of object or do they vary widely in type? Are they going to be long lived i.e. years and have the expectation of schema evolution & backwards compatibility etc.)
Phoenix is a JDBC api to HBase. It handles most SQL types (except intervals) - you can store arbitrary java objects using the binary data type. But if you are only storing binary data, you could easily stick with HBase. If you can coerce your data in standard SQL types, Phoenix may be a good option.
If you want to stick with the Hadoop/HBase code you can have your complex class implement org.apache.hadoop.io.Writable.
// Some complex java object
// that implements org.apache.hadoop.io.Writable
SomeObject myObject = new SomeObject();
// write the object to a byte array
// for storage in HBase
byte[] byteArr = WritableUtils.toByteArray(myObject);
Reference
Related
I am using redis as centralized cache for distributed system. Currently i am using jedis to connect to redis cluster, where i am storing the value as byte[] instead of string. My question is does storing plain string or byte [] has impact on getting the data. In my application i serialize my java pojo object and convert to byte [] and then store, where as i can convert it to json and store so while getting it from redis i can readily use the object instead of deserialize. I have tried both but the only difference i can see is the extra step of deserialize
In Redis, everything is a byte[]. What redis calls as strings are actually byte[] in programming languages.
When you store JSON, you still need to serialize it to byte[] before saving to redis, and do the reverse when you read back. This is no different from serializing a java object. In other words, you always have to pay the cost of serialization and deserialization.
That said, different libraries have different serialization costs. Java serialization is know to be slow and inefficient. JSON is likely to be better than java serialization - but wastes memory in redis because it is a text based. You can choose a better serialization library.
Kryo is a faster replacement for the java serializer. Message Pack is like JSON but faster. Protocol Buffers / Flat Buffers are even better, but require you to declare a schema upfront. There are other serialization formats as well, each with their tradeoffs.
The general recommendation - try to use the hash datatype. It is efficient, and lets you request specific fields instead of the whole object. Only if hash does not work for you, pick something else based on your needs.
P.S. If you are into benchmarks, this website has several - https://github.com/eishay/jvm-serializers/wiki
I have a class like below:
public class Person
{
public String name;
public String age;
}
I am a bit confused over the approach of saving a Map of Perons into Redis:
Should I go for java serialized/deserialized object approach or should i try converting to JSON and then storing and vice versa.
Any thoughts on below mentioned points:
Cost of serialization and deserialization VS cost of mapping to Java and to JSON
memory Requirement for JSON and serialized object for Redis
Compression : Stream vs Data
Which compression should we go for
Though DATA compression seems a bit difficult(not much benificial) as we are using Redish Hash
Some of the assumptions are:
The pojo contain many instancd variables
will be using Redis hash to store object
You should consider using MessagePack as it is full compatible with Redis and Lua, it is a great compression on JSON: http://msgpack.org/
It implies some Lua code to compress and uncompress, but the cost should be small. Here is an example: http://gists.fritzy.io/2013/11/06/store-json-as-msgpack
There is a small benchmark which lacks data: https://gist.github.com/muga/1119814
Still it should be a great option for you, as you can use it in different languages, fully supported on redis, and it is based on JSON.
The answer is you should measure it for your use cases and environment. I would first try JSON at it's more versatile and less problematic - i.e. easier to debug and restore corrupted data.
Performance. JSON serialization is fast, so in many scenarios it won't be your bottleneck. Most probably it is disk or network IO: java serialization benchmarking. Avoid using default Java serialization as it is slow. Kryo is an option for binary output. If you need miltiple platforms for binary format consider DB's internal format or i.e. Google Protobuffers.
Compression. In Google they use Snappy for less-cpu-demanding compression. Snappy is also used in Cassandra, Hadoop and Hypertable. Some benchmarks for JVM compressors: Compression test using Calgary corpus data set .
MongoDB gives the ability to write documents of any structure i.e. any number and types of key/value pairs can be written. Assuming that I use this features that my documents are indeed schema-less then how do I manage reads, basically how does the application code ( I'm using Java ) manage reads from the database.
The java driver reads and writes documents as BasicBSONObjects, which implement and are used as Map<String, Object>. Your application code is then responsible for reading this map and casting the values to the appropriate types.
A mapping framework like Morphia or Spring MongoDB can help you to convert BSONObject to your classes and vice versa.
When you want to do this yourself, you could use a Factory method which takes a BasicBSONObject, checks which keys and values it has, uses this information to create an object of the appropriate class and returns it.
Note: Due to the lack of questions like this on SO, I've decided to put one up myself as a Q&A
Serializing objects (using an ObjectOutputStream and an ObjectInputStream) is a method for storing an instance of a Java Object as data that can be later deserialized for use. This can cause problems and frustration when the Class used to deserialize the data does not remain the same (source-code changes; program updates).
So how can an Object be serialized and deserialized with an updated / downgraded version of a Class?
Here are a few common ways of serializing an object that can be deserialized in a backwards-compatible way.
1. Store the data in the JSON format using import and export methods designed to save all fields needed to recreate the instance. This can be made backwards-compatible by including a version key that allows for an update algorithm to be called if the version is too low. A common library for this is the Google Gson library which can represent Java objects in JSON as well as normally editing a JSON file.
2. Use the built-in java Properties class in a way similar to the method described above. Properties objects can be later stored using a stream (store()) written as a regular Java Properties file, or saved in XML (storeToXML()).
3. Sometimes simple objects can be easily represented with key-value pairs in a place where storing them in a JSON, XML, or Properties file is either too complicated or not neccessary (overkill one could say). In this case, an effective way of serializing the object could be using the ObjectOutputStream class to serialize a HashMap object containing key-value pairs where the key could be a String and the value could be an Object (HashMap<String,Object>). This allows for all of the object's fields to be stored as well as including a version key while providing much versatility.
Note: Although serializing an object using the ObjectOutputStream for persistence storage is normally considered bad convention, it can be used either way as long as the class' source code remains the same.
Also Note about versioning: Changes to a class can be safely made without disrupting deserialization using an ObjectOutputStream as long as they are a compatible change. As mentioned in the Versioning of Serializable Objects chapter of the Object Serialization Specification:
A compatible change is a change that does not affect the contract
between the class and its callers.
The MongoDB Java driver documentation of the packet org.bson mentions various "Lazy" versions of other classes. Unfortunately the Javadocs of these classes can barely be called documentation.
What is their purpose and how does their behavior differ from the normal versions?
Under normal operation, the driver creates and consumes documents using the DBObject Map-like interface. When inserting documents, it iterates over the map to convert it the corresponding BSON representation. When querying, it creates new documents by putting key-value pairs into the map.
But there are times when you want to work with raw BSON and not pay the cost of all this serialization and deserialization. That's what the lazy DBObject implementations are for. Instead of treating them as a map, a custom encoder instead writes the bytes directly to the BSON stream. Similarly, a custom decoder writes the raw bytes directly into the lazy DBObject.
In this context, the meaning of the term lazy is that, since the lazy equivalents still have to implement the DBObject interface, they do so by "lazily" interpreting the raw BSON byte array that they contain.
One last note: the lazy DBObject classes are very likely not going to be included in the upcoming 3.0 release of the driver, as the entire serialization is changing in a way that is not compatibile with lazy DBObjects. There will be equivalent functionality, but not API compatible.