Avro vs Protobuf Performance - java

I wrote a JMH benchmark to compare the serialization performance of Avro (1.8.2) & Protobuf (3.5.0) in java 1.8. According to JMH, Protobuf can serialize some data 4.7 million times in a second where as Avro can only do 800k per second.
The test data that was serialized is around 200 bytes and I generated schema for both Avro and Protobuf.
Here is my Avro serialization code, can someone familiar with Avro ensure that I haven't made some cardinal mistake?
The method called serialize is what JMH benchmarked. Also, I have posted this at https://groups.google.com/forum/#!topic/protobuf/skmE78F-XbE
Many Thanks
public final class AvroSerialization{
private BinartEncoder encoder;
private final SpecificDatumWriter writer;
public AvroSerialization( ){
this.writer = new SpecificDatumWriter( AvroGeneratedClass.class );
}
//MyDataObject = A pojo that contains the data to be serialized
public final byte[] serialize( MyDataObject data ){
ByteArrayOutputStream out = new ByteArrayOutputStream( 1024 );
encoder = EncoderFactory.get().binaryEncoder( out, encoder );
AvroGeneratedClass avroData = createAvro( data );
writer.write( avroData, encoder );
encoder.flush();
return out.toByteArray();
}
//AvroGeneratedClass = Class generated by the Avro Schema
public final static AvroGeneratedClass createAvro( MyDataObject data ){
AvroGeneratedClass avroData = AvroGeneratedClass.newBuilder()
.setXXX( data.getXXX )
.setXXX( data.getXXX )
...
return avroData;
}
}

AVRO always serialize data with its schema.
In the protobuf approach the server assumes the client already knows the schema so it just serialize the data to binary format.
For transactional workloads protobuf is usually better.
AVRO is usually better for analytical workloads where you need to serialize a huge amount of records. In this case, the schema serialization is often negligible and AVRO serialization is slightly more compact.

Related

Creating test data from Confluent Control Center JSON representation

I'm trying to write some unit tests for Kafka Streams and have a number of quite complex schemas that I need to incorporate into my tests.
Instead of just creating objects from scratch each time, I would ideally like to instantiate using some real data and perform tests on that. We use Confluent with records in Avro format, and can extract both schema and a text JSON-like representation from the Control Center application. The JSON is valid JSON, but it's not really in the form that you'd write it in if you were just writing JSON representations of the data, so I assume it's some representation of the underlying AVRO in text form.
I've already used the schema to create a Java SpecificRecord class (price_assessment) and would like to use the JSON string copied from the Control Center message to populate a new instance of that class to feed into to my unit test InputTopic.
The code I've tried so far is
var testAvroString = "{JSON copied from Control Center topic}";
Schema schema = price_assessment.getClassSchema();
DecoderFactory decoderFactory = new DecoderFactory();
Decoder decoder = null;
try {
DatumReader<price_assessment> reader = new SpecificDatumReader<price_assessment>();
decoder = decoderFactory.get().jsonDecoder(schema, testAvroString);
return reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
which is adapted from another SO answer that was using GenericRecords. When I try running this though I get the exception Cannot invoke "org.apache.avro.Schema.equals(Object)" because "writer" is null on the reader.read(...) step.
I'm not massively familiar with streams testing or Java and I'm not sure what exactly I've done wrong. Written in Java 17, streams 3.1.0, though flexible with version
The solution that I've managed to come up with is the following, which seems to work:
private static <T> T avroStringToInstance(Schema classSchema, String testAvroString) {
DecoderFactory decoderFactory = new DecoderFactory();
GenericRecord genericRecord = null;
try {
Decoder decoder = decoderFactory.jsonDecoder(classSchema, testAvroString);
DatumReader<GenericData.Record> reader =
new GenericDatumReader<>(classSchema);
genericRecord = reader.read(null, decoder);
} catch (Exception e)
{
return null;
}
var specific = (T) SpecificData.get().deepCopy(genericRecord.getSchema(), genericRecord);
return specific;
}

Alternative to the ProtoBuff encoding for REST?

I'm looping through the Java POJO (fetching data from the database) and then using protostuff library to covert it into byte[] and again encoding it into ProtoBuff classes so that I can send it as a content-type →application/x-protobuf
My biggest concern is in the folloiwing line, where I need to encode for each row:
for (InstrumentHistory instrumentHistory : instrumentHistoryRepository.getAllInstrumentHistorys()) {
Schema<InstrumentHistory> schema = RuntimeSchema.getSchema(InstrumentHistory.class);
LinkedBuffer buffer = LinkedBuffer.allocate(LinkedBuffer.DEFAULT_BUFFER_SIZE);
final byte[] protostuff;
try {
protostuff = ProtostuffIOUtil.toByteArray(instrumentHistory, schema, buffer);
instrumentHistoryProtos.add(InstrumentHistoryProto.InstrumentHistory.parseFrom(protostuff));
} finally {
buffer.clear();
}
}
return InstrumentHistoryProto.InstrumentHistorys.newBuilder().addAllInstrumentHistory(instrumentHistoryProtos).build();
Is there any workaround or simple way to do this?

Fastest way to read a large XML file in Java

I'm working on a java project to optimize existing code. Currently i'm using BufferedReader/FileInputStream to read content of an XML file as String in Java.
But my question is , is there any faster way to read XML content.Are SAX/DOM faster than BufferedReader/FileInputStream?
Need help regarding the above issue.
Thanks in advance.
I think that your code shown in other question is faster than DOM-like parsers which would definitely require more memory and likely some computation in order to reconstruct the document in full. You may want to profile the code though.
I also think that your code can be prettified a bit for streaming processing if you would use javax XMLStreamReader, which I found quite helpful for many tasks. That class is "... is designed to be the lowest level and most efficient way to read XML data", according to Oracle.
Here is the excerpt from my code where I parse StackOverflow users XML file distributed as a public data dump:
// the input file location
private static final String fileLocation = "/media/My Book/Stack/users.xml";
// the target elements
private static final String USERS_ELEMENT = "users";
private static final String ROW_ELEMENT = "row";
// get the XML file handler
//
FileInputStream fileInputStream = new FileInputStream(fileLocation);
XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(
fileInputStream);
// reading the data
//
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
// this triggers _users records_ logic
//
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
// read and parse the user data rows
//
while (xmlStreamReader.hasNext()) {
eventCode = xmlStreamReader.next();
// this breaks _users record_ reading logic
//
if ((XMLStreamConstants.END_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(USERS_ELEMENT)) {
break;
}
else {
if ((XMLStreamConstants.START_ELEMENT == eventCode)
&& xmlStreamReader.getLocalName().equalsIgnoreCase(ROW_ELEMENT)) {
// extract the user data
//
User user = new User();
int attributesCount = xmlStreamReader.getAttributeCount();
for (int i = 0; i < attributesCount; i++) {
user.setAttribute(xmlStreamReader.getAttributeLocalName(i),
xmlStreamReader.getAttributeValue(i));
}
// all other user record-related logic
//
}
}
}
}
}
That users file format is quite simple and similar to your Bank.xml file:
<users>
<row Id="1567200" Reputation="1" CreationDate="2012-07-31T23:57:57.770" DisplayName="XXX" EmailHash="XXX" LastAccessDate="2012-08-01T00:55:12.953" Views="0" UpVotes="0" DownVotes="0" />
...
</users>
There are different parser options available.
Consider using a streaming parser, because the DOM may become quite big. I.e. either a push or a pull parser.
It's not as if XML parsers are necessarily slow. Consider your web browser. It does XML parsing all the time, and tries really hard to be robust to syntax errors. Usually, memory is the bigger issue.

Save my Objects through serialization or make an XML file

I want a build an application, it's my first time so I have come up with a dilemma. In my application I have Persons and Projects and its one has its attributes. A project is done by some persons, and in each Project there is a Coordinator.
public class Person
{
private String firstName;
private String lastName;
private String mailAddress;
private String ID;
//more
}
also I have a Coordinator person:
public class Coordinator extends Person
{
private String type;
//more code
}
and then I have projects
public class Project
{
private String projectInfo;
private String nameOfProject;
private int projectID;
//more code
}
My dilemma is this. Should I store all the objects to a list or HasMap and then through object serialization to my disk or should I make an XML representation (like below) and then read my XML with a DOM parser? In XML way, everytime I will run the application, I will have to create my objects again, right? On the contrary with the serialization I will just read again my Objects from the disk.
<project>
<active></active>
<complete></complete>
<name></name>
<info></info>
<coordinator></coordinator>
<level1> //each project is distributed to different levels.
<cordinator> </cordinator>
<budget></budget>
<startDate></startDate>
<endDate></endDate>
<totalTasks> </totalTasks>
<complete></complete>
<task1>
<cordinator> </cordinator>
<personInvolved></personInvolved>
<personInvolved></personInvolved>
<personInvolved></personInvolved>
<personInvolved></personInvolved>
<budget></budget>
<startDate></startDate>
<endDate></endDate>
<complete></complete>
</task1>
<task2>
//same as task1
</task2>
</level1>
<level2>
//same as level1
</level2>
</project>
The choice between serializing using Java or XML doesn't depend on where you'd like to serialize it to. Both can be saved to a file. Java serialization dependent on the fact that every process reading the file is a Java program. (Nothing besides a Java program could read that format.) XML however is about interoperability. Any type of program can read an XML file and load in that data through some sort of library (JAXB or other non-Java XML serialization libraries).
Maintaining compatibility of serialized Java objects can be troublesome as the class changes, but it's not insurmountable. I don't try it though. If that were a factor for you, you might want to consider XML even if only Java programs were going to read it.
So the issue is -- who needs to read the data you're writing, and what are your needs for writing it in the first place?
I think the best way is the serialization, you can write down the object you want and easily read it:
Here an useful example
Here the official docs form Oracle
Serialize Person p:
try {
FileOutputStream fileOut = new FileOutputStream("/tmp/your_filename.ser");
ObjectOutputStream out = new ObjectOutputStream(fileOut);
out.writeObject(p);
out.close();
fileOut.close();
}
catch(IOException i) {
// error management
}
Deserialize Person p:
try {
FileInputStream fileIn = new FileInputStream("/tmp/your_filename.ser");
ObjectInputStream in = new ObjectInputStream(fileIn);
p = (Person) in.readObject();
in.close();
fileIn.close();
}
catch(IOException i) {
// error management
}
catch(ClassNotFoundException c) {
// error management
}

TSerializer serializer = new TSerializer() in C#

Is there any equivalent of TSerializer in the Thrift C# API.
I am trying to use thrift serialization and then push the serialized object into MQ, not using Thrift transport mechanism. On the other end I'll deserialize it to the actual message.
I can do it in Java but not in C#.
The Apache Thrift C# library doesn't have a TSerializer presently. However it does have a TMemoryBuffer (essentially a transport that reads/writes memory) which works perfectly for this kind of thing. Create a TMemoryBuffer, construct a protocol (like TBinaryProtocol) and then serialize your messages and send them as blobs from the TMemoryBuffer.
For example:
TMemoryBuffer trans = new TMemoryBuffer(); //Transport
TProtocol proto = new TCompactProtocol(trans); //Protocol
PNWF.Trade trade = new PNWF.Trade(initStuff); //Message type (thrift struct)
trade.Write(proto); //Serialize the message to memory
byte[] bytes = trans.GetBuffer(); //Get the serialized message bytes
//SendAMQPMsg(bytes); //Send them!
To receive the message you just do the reverse. TMemoryBuffer has a constructor you can use to set the received bytes to read from.
public TMemoryBuffer(byte[] buf);
Then you just call your struct Read() method on the read side I/O stack.
This isn't much more code (maybe less) than using the Java TSerializer helper and it is a bit more universal across Apache Thrift language libraries. You may find TMemoryBuffer is the way to go everywhere!
Credit due to the other answer on this page, and from here:
http://www.markhneedham.com/blog/2008/08/29/c-thrift-examples/
Rather than expecting everyone to take the explanations and write their own functions, here are two functions to serialize and deserialize generalized thrift objects in C#:
public static byte[] serialize(TBase obj)
{
var stream = new MemoryStream();
TProtocol tProtocol = new TBinaryProtocol(new TStreamTransport(stream, stream));
obj.Write(tProtocol);
return stream.ToArray();
}
public static T deserialize<T>(byte[] data) where T : TBase, new()
{
T result = new T();
var buffer = new TMemoryBuffer(data);
TProtocol tProtocol = new TBinaryProtocol(buffer);
result.Read(tProtocol);
return result;
}
There is an RPC framework that uses the standard thrift Protocol named "thrifty", and it is the same effect as using thrift IDL to define the service, that is, thrify can be compatible with code that uses thrift IDL, and it include serializer:
[ThriftStruct]
public class LogEntry
{
[ThriftConstructor]
public LogEntry([ThriftField(1)]String category, [ThriftField(2)]String message)
{
this.Category = category;
this.Message = message;
}
[ThriftField(1)]
public String Category { get; }
[ThriftField(2)]
public String Message { get; }
}
ThriftSerializer s = new ThriftSerializer(ThriftSerializer.SerializeProtocol.Binary);
byte[] s = s.Serialize<LogEntry>();
s.Deserialize<LogEntry>(s);
more detail: https://github.com/endink/Thrifty

Categories