Dynamic recognition and handling of protocol data units in bytestream

Dynamic recognition and handling of protocol data units in bytestream - java

I implemented a protocol for a little multiplayer game. It was based on bytes, so to deserialize the received messages I had to iterate over the byte stream and parse it bit by bit. After I had all the bytes and knew the message type, I threw the bytes in a reverse constructor that constructed the protocol data unit from the raw bytes.
This whole process was very ugly, not really OO and had unreadable if/else code.
I had to implement the reverseConstructor(byte[] bytes) for every protocol data unit (pdu) I added. An approach where some kind of schema is defined per pdu (e. g. schema = [1 byte int (id = x), x bytes ascii string, 4 bytes double]), and where the handling of the bytes is done with that schema, would be more elegant.
I got a hint here on SO to use google's protobufs (Apparently they are not fitting my needs, since I would have to change the protocol to adhere to protobuf standards).
INFO
I can't change the protocol. There are two different scenarios (I don't want to support them at the same time or even in the same program):
The protocol data units have a length field encoded in the header
The protocol data units have no length field, but one can derive from the message type when/where the message ends.
I personally am a fan of length fields. But sometimes you have to adhere to a protocol that somebody else designed. So the protocols are fix. They all have a header which contains the protocol id, the unique message id, and in the first scenario a length field.
QUESTION
Can anyone give me a very small example with two simple protocol data units that are parsed by a efficient, generic receive method? The only examples I found in the protobuf tutorials were of type: user a sends message x, user b expects message X and can deserialize it without problem.
But what if user b has to be prepared for message x, y and z. How can one handle this situation without much code duplication in an intelligent way.
I would also appreciate hints to design principles that enable me to achieve greater code here without the use of a extern library.
EDIT
I think sth like that is the way to go. You can find more of the code here.
The bytes are read dynamically till an object is found, and then the position of the buffer is reset.
while (true) {
if (buffer.remaining() < frameLength) {
buffer.reset();
break;
}
if (frameLength > 0) {
Object resultObj = prototype.newBuilderForType().mergeFrom(buffer.array(), buffer.arrayOffset() + buffer.position(), frameLength).build();
client.fireMessageReceived(resultObj);
buffer.position(buffer.position() + frameLength);
buffer.mark();
}
if (buffer.remaining() > fieldSize) {
frameLength = getFrameLength(buffer);
} else {
break;
}
}
JavaDoc - mergeFrom
Parse data as a message of this type and merge it with the message being built. This is just a small wrapper around MessageLite.Builder.mergeFrom(CodedInputStream).
https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/Message.Builder#mergeFrom(byte[])
The problem is the part message of this type, but it should be possible to address this issue with a generic approach.
SAMPLE
Here is a sample protocol data unit. It has a length field. There is another scenario where the pdus have no length field. This pdu is of variable size. There are also pdus of fixed size.
For completeness' sake. Here the representation of strings in the protocol data units.

1. Protocol design
Frankly, it is a common mistake to create a first protocol implementation without any further changes in mind. Just as an exercise, let's try to design flexible protocol.
Basically, the idea is to have several frames encapsulated into each other. Please note, you have Payload ID available, so it is easy to identify next frame in the sequence.
You can use Wireshark in order to see real-life protocols usually follow the same principle.
Such approach simplifies packet dissection a lot, but it is still possible to deal with other protocols.
2. Protocol decoding(dissection)
I spent quite a lot of time developing next generation network analyzer for my previous company.
Can't expose all the details, but one of the key features was flexible protocol stack, capable of identifying protocol frames. RTP is a good example, because there is no hint on the lower layer (usually UDP) next frame is a RTP frame. Special VM was developed to execute dissectors and control the process.
The good news I have smaller personal projects with Java-based dissectors (I'll skip some javadocs in order to save several lines).
/**
* High-level dissector contract definition. Dissector is meant to be a simple
* protocol decoder, which analyzes protocol binary image and produces number
* of fields.
*
* #author Renat.Gilmanov
*/
public interface Dissector {
/**
* Returns dissector type.
*/
DissectorType getType();
/**
* Verifies packet data belongs to the protocol represented by this dissector.
*/
boolean isProtocol(DataInput input, Dissection dissection);
/**
* Performs the dissection.
*/
Dissection dissect(DataInput input, Dissection dissection);
/**
* Returns a protocol which corresponds to the current dissector.
*
* #return a protocol instance
*/
Protocol getProtocol();
}
Protocol itself knows upper layer protocol, so when there is no direct hint available it is possible to iterate through known protocols and use isProtocol method in order to identify the next frame.
public interface Protocol {
// ...
List<Protocol> getUpperProtocols(); }
As I said RTP protocol is a bit tricky to handle:
So let's check implementation details. Verification is based on several known facts about the protocol:
/**
* Verifies current frame belongs to RTP protocol.
*
* #param input data input
* #param dissection initial dissection
* #return true if protocol frame is RTP
*/
#Override
public final boolean isProtocol(final DataInput input, final Dissection dissection) {
int available = input.available();
byte octet = input.getByte();
byte version = getVersion(octet);
byte octet2 = input.getByte(1);
byte pt = (byte) (octet2 & 0x7F);
return ((pt < 0x47) & (RTP_VERSION == version));
}
Dissection is just a set of basic operations:
public final Dissection dissect(DataInput input, Dissection d) {
// --- protocol header --------------------------------
final byte octet1 = input.getByte(0);
final byte version = getVersion(octet1);
final byte p = (byte) ((octet1 & 0x20) >> 5);
final byte x = (byte) ((octet1 & 0x10) >> 4);
final byte cc = (byte) ((octet1 & 0x0F));
//...
// --- seq --------------------------------------------
final int seq = (input.getInt() & 0x0000FFFF);
final int timestamp = input.getInt();
final int ssrc = input.getInt();
Finally you can define a Protocol stack:
public interface ProtocolStack {
String getName();
Protocol getRootProtocol();
Dissection dissect(DataInput input, Dissection dissection, DissectOptions options);
}
Under the hood it handles all the complexity and decodes a packet, frame by frame. The biggest challenge is to make the dissection process bullet-proof and stable. Using such or similar approach you'll be able to organize you protocol decoding code. It is likely proper implementation of isProtocol will let you handle different version and so on. Anyway, I would not say this approach is simple, but it provides a lot of flexibility and control.
3. Is there any universal solution?
Yes, there is ASN.1:
Abstract Syntax Notation One (ASN.1) is a standard and notation that
describes rules and structures for representing, encoding,
transmitting, and decoding data in telecommunications and computer
networking. The formal rules enable representation of objects that are
independent of machine-specific encoding techniques. Formal notation
makes it possible to automate the task of validating whether a
specific instance of data representation abides by the specifications.
In other words, software tools can be used for the validation.
Here is an example of a protocol defined using ASN.1:
FooProtocol DEFINITIONS ::= BEGIN
FooQuestion ::= SEQUENCE {
trackingNumber INTEGER,
question IA5String
}
FooAnswer ::= SEQUENCE {
questionNumber INTEGER,
answer BOOLEAN
}
END
BTW, there is Java Asn.1 Compiler available:
JAC (Java Asn1 Compiler) is a tool for you if you want to (1)parse
your asn1 file (2)create .java classes and (3)encode/decode instances
of your classes. Just forget all asn1 byte streams, and take the
advantage of OOP! BER, CER and DER are all supported.
Finally
I usually recommend to do several simple PoCs in order to find best possible solution. I decided not to use ASN.1 in order to reduce complexity and have some room for optimization, but it might help you.
Anyway, try everything you can and let us know about the results :)
You can also check the following topic: Efficient decoding of binary and text structures (packets)
4. Update: bidirectional approach
I'm sorry for quite a long answer. I just want you to have enough options to find best possible solution. Answering the question regarding bidirectional approach:
Option 1: you can use symmetrical serialization approach: define DataOutput, write serialization logic -- you are done. I will just recommend to looks through BerkeleyDB API and TupleBinding. It does solve the same problem providing a full control on store/restore procedure.
This class takes care of converting the entries to/from TupleInput and
TupleOutput objects. Its two abstract methods must be implemented by a
concrete subclass to convert between tuples and key or data objects.
entryToObject(TupleInput)
objectToEntry(Object,TupleOutput)
Option 2: The most universal way is to define your structure containing a set of fields. Every field requires the following information:
name
type
size (bits)
For example, for RTP it will look like the following:
Version: byte (2 bits)
Padding: bool (1 bit)
Extension: bool (1 bit)
CSRC Count: byte (4 bits)
Marker: bool (1 bit)
Payload Type: byte (7 bits)
Sequence Number: int (16 bits)
Having that you can define generic way of reading/writing such structures. Closest working example I know is Javolution Struct. Please look through, they have a really good examples:
class Clock extends Struct { // Hardware clock mapped to memory.
Unsigned16 seconds = new Unsigned16(5); // unsigned short seconds:5 bits
Unsigned16 minutes = new Unsigned16(5); // unsigned short minutes:5 bits
Unsigned16 hours = new Unsigned16(4); // unsigned short hours:4 bits
...
}

(Note: it's been a while since I've used Java so I wrote this in C#, but you should get the general idea)
The general idea is:
Each of your parsers should be basically represented as an interface, or a delegate (or a method, or a function pointer) with a signature of something like:
interface IParser<T>
{
IParserResult<T> Parse(IIndexable<byte> input);
}
The result of the parsing operation is an instance of the IParserResult<T> interface, which should tell you the following:
Whether the parsing succeeded,
If it failed, why it failed (not enough data to finish parsing, not the right parser, or CRC error, or exception while parsing),
If it succeeded, the actual parsed message value,
If it succeeded, the next parser offset.
In other words, something like:
interface IParserResult<T>
{
boot Success { get; }
ErrorType Error { get; } // in case it failed
T Result { get; } // null if failed
int BytesToSkip { get; } // if success, number of bytes to advance
}
Your parser thread should iterate through a list of parsers and check results. It should look more or less like this:
// presuming inputFifo is a Queue<byte>
while (inputFifo.ContainsData)
{
foreach (IParser parser in ListOfParsers)
{
var result = parser.Parse(inputFifo);
if (result.Success)
{
FireMessageReceived(result.Value);
inputFifo.Skip(result.BytesToSkip);
break;
}
// wrong parser? try the next one
if (result.ErrorType == ErrorType.UnsupportedData)
{
continue;
}
// otherwise handle errors
switch (result.ErrorType)
{
...
}
}
}
The IIndexable<byte> interface is not a part of .NET, but it's rather important for avoiding numerous array allocations (this is the CodeProject article).
The good thing about this approach is that the Parse method can do a whole lot of checks to determine if it "supports" a certain message (check the cookie, length, crc, whatever). We use this approach when parsing data which is constantly being received on a separate thread from unreliable connections, so each parser also returns a "NotEnoughData" error if the length is too short to tell if the message is valid or not (in which case the loop breaks and waits for more data).
[Edit]
Additionally (if this also helps you), we use a list (or a dictionary to be precise) of "message consumers" which are strongly typed and tied to a certain parser/message type. This way only the interested parties are notified when a certain message is parsed. It's basically a simple messaging system where you need to create a list of parsers, and a dictionary of mappings (message type -> message consumer).

At the 10,000 ft. level, this a classic case where the Factory Pattern is useful. Your code will be much cleaner (and therefore easier to optimize) if you think of this problem in terms of the Factory Pattern (and I have written it the other way, so unfortunately I know -- what was several days of work reduced to a few hours once I applied the Factory Pattern).
[edit...]
For the bytes -> object case, you will need to read enough bytes to determine unambiguously which type of object has been passed down the wire, then proceed with parsing that object serialization.

You create a message (say myMessage) with optional messages for x, y, z. This is discussed here. i.e. here is an example with Foo, Barr, Baz from the techniques document
message OneMessage {
// One of the following will be filled in.
optional Foo foo = 1;
optional Bar bar = 2;
optional Baz baz = 3;
}

Protocol Buffers define their own wire protocol based on binary tag/value pairs. If you have a pre-existing protocol that you cannot change, you cannot use Protocol Buffers to parse it.

I have an idea: the use of annotations like JAXB does to automate the process of converting the message objects to their defined byte representation. It should also be able to recreate/unmarshal the raw bytes to message objects (but only in the scenario with the length field I guess).
But using annotations would include the use of some light reflection. Which could eventually decrease the performance(?).
Here's an example (...as I wrote the example, I came to the conclusion I could probably use the jaxb annotations. Since they also hanve #XmlTypeAdapter to also support maps etc.. Anyway, here the example):
/**
* Annotation that helps identifying data elements for encoding/decoding
* byte packets.
*
* Annotate public getter methods in message classes.
*
* NOTE: just primitive types and strings supported.
*
*/
#Retention (value = RetentionPolicy.RUNTIME)
#Target (value = ElementType.METHOD)
public #interface MessageAttribute
{
/* relative position of attribute in byte packet
* 0 = first, 1 = second, etc.
*/
int position();
/*
* type of attribute
*/
Class<?> type();
}

Well, maybe this isn't helpful, but: In quake, which uses a very similar protocol the algorhithm is something like this (the receiver/server knows the player ids already).
ByteBuffer frame;
int first = frame.getInt(), magic = 0x4C444732;
if( first != magic )
if( !player_list.containsKey(first) ) /* must be a "string" pdu (a chat)*/
x = new StringPDU( frame );
else /* not a chat? must be a player_id */
x = new PlayerNamePDU( frame );
else /* starts with magic... a game startup pdu + playername
x = new GamePDU( frame ); /* maybe that's the player host, or must have at least
one player */
each PDU has a readFrame method or a constructor that reads the bytes from the ByteBuffer. It looks ugly, but short of using reflection, is necessary.
class GamePDU extends PDU {
byte command;
short length;
byte min_players;
short time_to_start;
byte num_players; /// after this same as a player_name packet
GamePDU( ByteBuffer b ) {
command = b.readByte();
length = b.readShort();
min_players = b.readByte();
time_to_start = b.readShort();
num_players = b.readByte();
// the rest of the frame is player name
/// players_for_game.add( new PlayerPDU(b) );
/// this player is in the game_start pdu to ensure that the
//// player_list[ num_players ] has been allocated. and has a list head. ;) Whips!!
}
/** if the same code is reading/writing both ends, you don't have to worry about
endianess or signededness. ;)
In C, in parallel, some of the game code just whips!!!
*/
}
class PDU {}
class GamePDU extends PDU {}
class PlayerNamePDU extends PDU {}
class StringPDU extends PDU {}

Related

Java: fastest way to serialize to a byte buffer

I'm required to work on a serialization library in Java which must be as fast as possible. The idea is to create various methods which will serialize the specified value and its associated key and puts them in a byte buffer. Several objects that wrap this buffer must be created since the objects that need to be serialized are potentially alot.
Considerations:
I know the Unsafe class may not be implemented in every JVM, but it's not a problem.
Premature optimization: this library has to be fast and this serialization is the only thing it has to do.
The objects once serialized are tipically small (less than 10k) but they are alot and they can be up to 2Gb big.
The underlying buffer can be expanded / reduced but I'll skip implementation details, the method is similar to the one used in the ArrayList implementation.
To clarify my situation: I have various methods like
public void putByte(short key, byte value);
public void putInt(short key, int value);
public void putFloat(short key, float value);
... and so on...
these methods append the key and the value in a byte stream, so if i call putInt(-1, 1234567890) my buffer would look like: (the stream is big endian)
key the integer value
[0xFF, 0xFF, 0x49, 0x96, 0x02, 0xD2]
In the end a method like toBytes() must be called to return a byte array which is a trimmed (if needed) version of the underlying buffer.
Now, my question is: what is the fastest way to do this in java?
I googled and stumbled upon various pages (some of these were on SO) and I also did some benchmarks (but i'm not really experienced in benchmarks and that's one of the reasons I'm asking for the help of more experienced programmers about this topic).
I came up with the following solutions:
1- The most immediate: a byte array
If I have to serialize an int it would look like this:
public void putInt(short key, int value)
{
array[index] = (byte)(key >> 8);
array[index+1] = (byte) key;
array[index+2] = (byte)(value >> 24);
array[index+3] = (byte)(value >> 16);
array[index+4] = (byte)(value >> 8);
array[index+5] = (byte) value;
}
2- A ByteBuffer (be it direct or a byte array wrapper)
The putInt method would look like the following
public void putInt(short key, int value)
{
byteBuff.put(key).put(value);
}
3- Allocation on native memory through Unsafe
Using the Unsafe class I would allocate the buffer on native memory and so the putInt would look like:
public void putInt(short key, int value)
{
Unsafe.putShort(address, key);
Unsafe.putInt(address+2, value);
}
4- allocation through new byte[], access through Unsafe
I saw this method in the lz4 compression library written in java. Basically once a byte array is instantiated i write bytes the following way:
public void putInt(short key, int value)
{
Unsafe.putShort(byteArray, BYTE_ARRAY_OFFSET + 0, key);
Unsafe.putInt(byteArray, BYTE_ARRAY_OFFSET + 2, value);
}
The methods here are simplified, but the basic idea is the one shown, I also have to implement the getter methods . Now, since i started to work in this i learnt the following things:
1- The JVM can remove array boundary checks if it's safe (in a for loop for example where the counter has to be less to the length of the array)
2- Crossing the JVM memory boundaries (reading/writing from/to native memory) has a cost.
3- Calling a native method may have a cost.
4- Unsafe putters and getters don't make boundary checks in native memory, nor on a regular array.
5- ByteBuffers wrap a byte array (non direct) or a plain native memory area (direct) so case 2 internally would look like case 1 or 3.
I run some benchmarks (but as I said I would like the opinion / experience of other developers) and it seems that case 4 is slightly (almost equals) to case 1 in reading and about 3 times faster in writing. It also seems that a for loop with Unsafe read and write (case 4) to copy an array to another (copying 8 bytes at time) is faster than System.arraycopy.
Long story made short (sorry for the long post):
case 1 seems to be fast, but that way I have to write a single byte each time + masking operations, which makes me think that maybe Unsafe, even if it's a call to native code may be faster.
case 2 is similar to case 1 and 3, so I could skip it (correct me if I'm missing something)
case 3 seems to be the slowest (at least from my benchmarks), also, I would need to copy from a native memory to a byte array because that's must be the output. But here this programmer claims it's the fastest way by far. If I understood correctly, what am I missing?
case 4 (as supported here) seems to be the fastest.
The number of choices and some contradictory information confuse me a bit, so can anyone clarify me these doubts?
I hope I wrote every needed information, otherwise just ask for clarifications.
Thanks in advance.

Case 5: DataOutputStream writing to a ByteArrayOutputStream.
Pro: it's already done; it's as fast as anything else you've mentioned here; all primitives are already implemented. The converse is DataInputStream reading from a ByteArrayInputStream.
Con: nothing I can think of.

Update data only by difference between files (delta for java)

UPDATE: I solved the problem with a great external library - https://code.google.com/p/xdeltaencoder/. The way I did it is posted below as the accepted answer
Imagine I have two separate pcs who both have an identical byte[] A.
One of the pcs creates byte[] B, which is almost identical to byte[] A but is a 'newer' version.
For the second pc to update his copy of byte[] A into the latest version (byte[] B), I need to transmit the whole byte[] B to the second pc. If byte[] B is many GB's in size, this will take too long.
Is it possible to create a byte[] C that is the 'difference between' byte[] A and byte[] B? The requirements for byte[] C is that knowing byte[] A, it is possible to create byte[] B.
That way, I will only need to transmit byte[] C to the second PC, which in theory would be only a fraction of the size of byte[] B.
I am looking for a solution to this problem in Java.
Thankyou very much for any help you can provide :)
EDIT: The nature of the updates to the data in most circumstances is extra bytes being inserted into parts of the array. Ofcourse it is possible that some bytes will be changed or some bytes deleted. the byte[] itself represents a tree of the names of all the files/folders on a target pc. the byte[] is originally created by creating a tree of custom objects, marshalling them with JSON, and then compressing that data with a zip algorithm. I am struggling to create an algorithm that can intelligently create object c.
EDIT 2: Thankyou so much for all the help everyone here has given, and I am sorry for not being active for such a long time. I'm most probably going to try to get an external library to do the delta-encoding for me. A great part about this thread is that I now know what I want to achieve is called! I believe that when I find an appropriate solution I will post it and accept it so others can see as to how I solved my problem. Once again, thankyou very much for all your help.

Using a collection of "change events" rather than sending the whole array
A solution to this would be to send a serialized object describing the change rather than the actual array all over again.
public class ChangePair implements Serializable{
//glorified struct
public final int index;
public final byte newValue;
public ChangePair(int index, byte newValue) {
this.index = index;
this.newValue = newValue;
}
public static void main(String[] args){
Collection<ChangePair> changes=new HashSet<ChangePair>();
changes.add(new ChangePair(12,(byte)2));
changes.add(new ChangePair(1206,(byte)3));
}
}
Generating the "change events"
The most efficient method for achieving this would be to track changes as you go, but assuming thats not possible you can just brute force your way through, finding which values are different
public static Collection<ChangePair> generateChangeCollection(byte[] oldValues, byte[] newValues){
//validation
if (oldValues.length!=newValues.length){
throw new RuntimeException("new and old arrays are differing lengths");
}
Collection<ChangePair> changes=new HashSet<ChangePair>();
for(int i=0;i<oldValues.length;i++){
if (oldValues[i]!=newValues[i]){
//generate a change event
changes.add(new ChangePair(i,newValues[i]));
}
}
return changes;
}
Sending and recieving those change events
As per this answer regarding sending serialized objects over the internet you could then send your object using the following code
Collection<ChangePair> changes=generateChangeCollection(oldValues,newValues);
Socket s = new Socket("yourhostname", 1234);
ObjectOutputStream out = new ObjectOutputStream(s.getOutputStream());
out.writeObject(objectToSend);
out.flush();
On the other end you would recieve the object
ServerSocket server = new ServerSocket(1234);
Socket s = server.accept();
ObjectInputStream in = new ObjectInputStream(s.getInputStream());
Collection<ChangePair> objectReceived = (Collection<ChangePair>) in.readObject();
//use Collection<ChangePair> to apply changes
Using those change events
This collection can then simply be used to modify the array of bytes on the other end
public static void useChangeCollection(byte[] oldValues, Collection<ChangePair> changeEvents){
for(ChangePair changePair:changeEvents){
oldValues[changePair.index]=changePair.newValue;
}
}

Locally log the changes to the byte array, like a little version control system. In fact you could use a VCS to create patch files, send them to the other side and apply them to get an up-to-date file;
If you cannot log changes, you would need to double the array locally, or (not so 100% safe) use an array of checksums on blocks.

The main problem here is data compression.
Kamikaze offers you good compression algorithms for data arrays. It uses Simple16 and PForDelta coding. Simple16 is a good and (as the name says) simple list compression option. Or you can use Run Lenght Encoding. Or you can experiment with any compression algorithm you have available in Java...
Anyway, any method you use will be optimized if you first preprocess the data.
You can reduce the data calculating differences or, as #RichardTingle pointed, creating pairs of different data locations.
You can calculate C as B - A. A will have to be an int array, since the difference between two byte values can be higher than 255. You can then restore B as A + C.
The advantage of combining at least two methods here is that you get much better results.
E.g. if you use the difference method with A = { 1, 2, 3, 4, 5, 6, 7 } and B = { 1, 2, 3, 5, 6, 7, 7 }. The difference array C will be { 0, 0, 0, 1, 1, 1, 0 }. RLE can compress C in a very effective way, since it is good for compressing data when you have many repeated numbers in sequence.
Using the difference method with Simple16 will be good if your data changes in almost every position, but the difference between values is small. It can compress an array of 28 single-bit values (0 or 1) or an array of 14 two-bit values to a single 32-byte integer.
Experiment, it all will depend on how your data behaves. And compare the data compression ratios for each experiment.
EDIT: You will have to preprocess the data before JSON and zip compressing.
Create two sets old and now. The latter contains all files that exists now. For the former, the old files, you have at least two options:
Should contain all files that existed before you sent them to the other PC. You will need to keep a set of what the other PC knows to calculate what has changed since the last synchronization, and send only the new data.
Contains all files since you last checked for changes. You can keep a local history of changes and give each version an "id". Then, when you sync, you send the "version id" together with the changed data to the other PC. Next time, the other PC first sends its "version id" (or you keed the "version id" of each PC locally), then you can send the other PC all the new changes (all the versions that come after the one that PC had).
The changes can be represented by two other sets: newFiles, and deleted files. (What about files that changed in content? Don't you need to sync these too?) The newFiles contains the ones that only exist in set now (and do not exist in old). The deleted set contains the files that only exist in set old (and do not exist in now).
If you represent each file as an String with the full pathname, you safely will have unique representations of each file. Or you can use java.io.File.
After you reduced your changes to newFiles and deleted files set, you can convert them to JSON, zip and do anything else to serialize and compress the data.

So, what I ended up doing was using this:
https://code.google.com/p/xdeltaencoder/
From my test it works really really well. However, you will need to make sure to checksum the source (in my case fileAJson), as it does not do it automatically for you!
Anyways, code below:
//Create delta
String[] deltaArgs = new String[]{fileAJson.getAbsolutePath(), fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
//Apply delta
deltaArgs = new String[]{"-d", fileAJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
//Trivia, Surpisingly this also works
deltaArgs = new String[]{"-d", fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

Is there a way to preserve or modify in-code radix information?

Let's say I have the following code:
int two = 2;
String twoInBinary = Integer.toString(two, 2);
The twoInBinary String will now hold the value 10. But it seems like the radix information is completely lost in this transformation. So, if I send twoInBinary as part of an XML file over a network and want to deserialize it into its integer format, like this...
int deserializedTwo = Integer.parseInt(twoInBinary);
... then deserializedTwo will equal 10 rather than 2 (in decimal).
I know there is the Integer.parseInt(String s, int radix), but in a complex system using many different radixes for many different strings, is it possible to preserve the radix information without having to keep a separate, synchronized log with your values?

Short answer: No, not in standard Java. It is, however, trivial to write a Serializable class that can transfer the value and radix information over the wire.
class ValueWithRadix implements Serializable
{
int radix;
String value;
}
int deserializedTwo = Integer.parseInt( valueWithRadix.getValue() , valueWithRadix.getRadix() );
Edit: To clarify yet more, the XML on the wire might then look like
<ValueWithRadix>
<value>10</value>
<radix>2</radix>
</ValueWithRadix>
rather than just
<value>10</value>
which of course doesn't preserve radix information.
Cheers,

If you're sending it as part of an XML file, you have to use the correct datatype definition. XML schema supports a lot of different built-intypes to describe your types accurately.
So <value>10</value> might currently be describing an integer, which is defined as base 10. You could quite easily describe a new simple type which expresses digits as base 2.

A C structure accessed in Java

I have a C structure that is sent over some intermediate networks and gets received over a serial link by a java code. The Java code gives me a byte array that I now want to repackage it as the original structure. Now if the receive code was in C, this was simple. Is there any simple way to repackage a byte[] in java to a C struct. I have minimal experience in java but this doesnt appear to be a common problem or solved in any FAQ that I could find.
FYI the C struct is
struct data {
uint8_t moteID;
uint8_t status; //block or not
uint16_t tc_1;
uint16_t tc_2;
uint16_t panelTemp; //board temp
uint16_t epoch#;
uint16_t count; //pkt seq since the start of epoch
uint16_t TEG_v;
int16_t TEG_c;
}data;

I would recommend that you send the numbers across the wire in network byte order all the time. This eliminates the problems of:
Compiler specific word boundary generation for your structure.
Byte order specific to your hardware (both sending and receiving).
Also, Java's numbers are always stored in network-byte-order no matter the platform that you run Java upon (the JVM spec requires a specific byte order).
A very good class for extracting bits from a stream is java.nio.ByteBuffer, which can wrap arbitrary byte arrays; not just those coming from a I/O class in java.nio. You really should not hand code your own extraction of primitive values if at all possible (i.e. bit shifting and so forth) since it is easy to get this wrong, the code is the same for every instance of the same type, and there are plenty of standard classes that provide this for you.
For example:
public class Data {
private byte moteId;
private byte status;
private short tc_1;
private short tc_2;
//...etc...
private int tc_2_as_int;
private Data() {
// empty
}
public static Data createFromBytes(byte[] bytes) throws IOException {
final Data data = new Data();
final ByteBuffer buf = ByteBuffer.wrap(bytes);
// If needed...
//buf.order(ByteOrder.LITTLE_ENDIAN);
data.moteId = buf.get();
data.status = buf.get();
data.tc_1 = buf.getShort();
data.tc_2 = buf.getShort();
// ...extract other fields here
// Example to convert unsigned short to a positive int
data.tc_2_as_int = buf.getShort() & 0xffff;
return data;
}
}
Now, to create one, just call Data.createFromBytes(byteArray).
Note that Java does not have unsigned integer variables, but these will be retrieved with the exact same bit pattern. So anything where the high-order bit is not set will be exactly the same when used. You will need to deal with the high-order bit if you expected that in your unsigned numbers. Sometimes this means storing the value in the next larger integer type (byte -> short; short -> int; int -> long).
Edit: Updated the example to show how to convert a short (16-bit signed) to an int (32-bit signed) with the unsigned value with tc_2_as_int.
Note also that if you cannot change the byte-order and it is not in network order, then java.nio.ByteBuffer can still serve you here with buf.order(ByteOrder.LITTLE_ENDIAN); before retrieving the values.

This can be difficult to do when sending from C to C.
If you have a data struct, cast it so that you end up with an array of bytes/chars and then you just blindly send it you can sometimes end up with big problems decoding it on the other end.
This is because sometimes the compiler has decided to optimize the way that the data is packed in the struct, so in raw bytes it may not look exactly how you expect it would look based on how you code it.
It really depends on the compiler!
There are compiler pragma's you can use to make packing unoptimized. See C/C++ Preprocessor Reference - pack
The other problem is the 32/64-bit bit problem if you just use "int", and "long" without specifying the number of bytes... but you have done that :-)
Unfortunately, Java doesnt really have structs... but it represents the same information in classes.
What I recommend is that you make a class that consists of your variables, and just make a custom unpacking function that will pull the bytes out from the received packet (after you have checked its correctness after transfer) and then load them in to the class.
e.g. You have a data class like
class Data
{
public int moteID;
public int status; //block or not
public int tc_1;
public int tc_2;
}
Then when you receive a byte array, you can do something like this
Data convertBytesToData(byte[] dataToConvert)
{
Data d = Data();
d.moteId = (int)dataToConvert[0];
d.status = (int)dataToConvert[1];
d.tc_1 = ((int)dataToConvert[2] << 8) + dataTocConvert[3]; // unpacking 16-bits
d.tc_2 = ((int)dataToConvert[4] << 8) + dataTocConvert[5]; // unpacking 16-bits
}
I might have the 16-bit unpacking the wrong way around, it depends on the endian of your C system, but you'll be able to play around and see if its right or not.
I havent played with Java for sometime, but hopefully there might be byte[] to int functions built in these days.
I know there are for C# anyway.
With all this in mind, if you are not doing high data rate transfers, definately look at JSON and Protocol Buffers!

Assuming you have control over both ends of the link, rather than sending raw data you might be better off going for an encoding that C and Java can both use. Look at either JSON or Protocol Buffers.

What you are trying to do is problematic for a couple of reasons:
Different C implementations will represent uint16_t (and int16_t) values in different ways. In some cases, the most significant byte will be first when the struct is laid out in memory. In other cases, the least significant byte will.
Different C compilers may pack the fields of the struct differently. So it is possible (for example) that the fields have been reordered or padding may have been added.
So what this all means is that you have to figure out exactly the struct is laid out ... and just hope that this doesn't change when / if you change C compilers or C target platform.
Having said that, I could not find a Java library for decoding arbitrary binary data streams that allows you to select "endian-ness". The DataInputStream and DataOutputStream classes may be the answer, but they are explicitly defined to send/expect the high order byte first. If your data comes the other way around you will need to do some Java bit bashing to fix it.
EDIT : actually (as #Kevin Brock points out) java.nio.ByteBuffer allows you to specify the endian-ness when fetching various data types from a binary buffer.

How to get data out of network packet data in Java

In C if you have a certain type of packet, what you generally do is define some struct and cast the char * into a pointer to the struct. After this you have direct programmatic access to all data fields in the network packet. Like so :
struct rdp_header {
int version;
char serverId[20];
};
When you get a network packet you can do the following quickly :
char * packet;
// receive packet
rdp_header * pckt = (rdp_header * packet);
printf("Servername : %20.20s\n", pckt.serverId);
This technique works really great for UDP based protocols, and allows for very quick and very efficient packet parsing and sending using very little code, and trivial error handling (just check the length of the packet). Is there an equivalent, just as quick way in java to do the same ? Or are you forced to use stream based techniques ?

Read your packet into a byte array, and then extract the bits and bytes you want from that.
Here's a sample, sans exception handling:
DatagramSocket s = new DatagramSocket(port);
DatagramPacket p;
byte buffer[] = new byte[4096];
while (true) {
p = new DatagramPacket(buffer, buffer.length);
s.receive(p);
// your packet is now in buffer[];
int version = buffer[0] << 24 + buffer[1] << 16 + buffer[2] < 8 + buffer[3];
byte[] serverId = new byte[20];
System.arraycopy(buffer, 4, serverId, 0, 20);
// and process the rest
}
In practise you'll probably end up with helper functions to extract data fields in network order from the byte array, or as Tom points out in the comments, you can use a ByteArrayInputStream(), from which you can construct a DataInputStream() which has methods to read structured data from the stream:
...
while (true) {
p = new DatagramPacket(buffer, buffer.length);
s.receive(p);
ByteArrayInputStream bais = new ByteArrayInputStream(buffer);
DataInput di = new DataInputStream(bais);
int version = di.readInt();
byte[] serverId = new byte[20];
di.readFully(serverId);
...
}

I don't believe this technique can be done in Java, short of using JNI and actually writing the protocol handler in C. The other way to do the technique you describe is variant records and unions, which Java doesn't have either.
If you had control of the protocol (it's your server and client) you could use serialized objects (inc. xml), to get the automagic (but not so runtime efficient) parsing of the data, but that's about it.
Otherwise you're stuck with parsing Streams or byte arrays (which can be treated as Streams).
Mind you the technique you describe is tremendously error prone and a source of security vulnerabilities for any protocol that is reasonably interesting, so it's not that great a loss.

I wrote something to simplify this kind of work. Like most tasks, it was much easier to write a tool than to try to do everything by hand.
It consisted of two classes, Here's an example of how it was used:
// Resulting byte array is 9 bytes long.
byte[] ba = new ByteArrayBuilder()
.writeInt(0xaaaa5555) // 4 bytes
.writeByte(0x55) // 1 byte
.writeShort(0x5A5A) // 2 bytes
.write( (new BitBuilder()) // 2 bytes---0xBA12
.write(3, 5) // 101 (3 bits value of 5)
.write(2, 3) // 11 (2 bits value of 3)
.write(3, 2) // 010 (...)
.write(2, 0) // 00
.write(2, 1) // 01
.write(4, 2) // 0002
).getBytes();
I wrote the ByteArrayBuilder to simply accumulate bits. I used a method chaining pattern (Just returning "this" from all methods) to make it easier to write a bunch of statements together.
All the methods in the ByteArrayBuilder were trivial, just like 1 or 2 lines of code (I just wrote everything to a data output stream)
This is to build a packet, but tearing one apart shouldn't be any harder.
The only interesting method in BitBuilder is this one:
public BitBuilder write(int bitCount, int value) {
int bitMask=0xffffffff;
bitMask <<= bitCount; // If bitcount is 4, bitmask is now ffffff00
bitMask = ~bitMask; // and now it's 000000ff, a great mask
bitRegister <<= bitCount; // make room
bitRegister |= (value & bitMask); // or in the value (masked for safety)
bitsWritten += bitCount;
return this;
}
Again, the logic could be inverted very easily to read a packet instead of build one.
edit: I had proposed a different approach in this answer, I'm going to post it as a separate answer because it's completely different.

Look at the Javolution library and its struct classes, they will do just what you are asking for. In fact, the author has this exact example, using the Javolution Struct classes to manipulate UDP packets.

This is an alternate proposal for an answer I left above. I suggest you consider implementing it because it would act pretty much the same as a C solution where you could pick fields out of a packet by name.
You might start it out with an external text file something like this:
OneByte, 1
OneBit, .1
TenBits, .10
AlsoTenBits, 1.2
SignedInt, +4
It could specify the entire structure of a packet, including fields that may repeat. The language could be as simple or complicated as you need--
You'd create an object like this:
new PacketReader packetReader("PacketStructure.txt", byte[] packet);
Your constructor would iterate over the PacketStructure.txt file and store each string as the key of a hashtable, and the exact location of it's data (both bit offset and size) as the data.
Once you created an object, passing in the bitStructure and a packet, you could randomly access the data with statements as straight-forward as:
int x=packetReader.getInt("AlsoTenBits");
Also note, this stuff would be much less efficient than a C struct, but not as much as you might think--it's still probably many times more efficient than you'll need. If done right, the specification file would only be parsed once, so you would only take the minor hit of a single hash lookup and a few binary operations for each value you read from the packet--not bad at all.
The exception is if you are parsing packets from a high-speed continuous stream, and even then I doubt a fast network could flood even a slowish CPU.

Short answer, no you can't do it that easily.
Longer answer, if you can use Serializable objects, you can hook your InputStream up to an ObjectInputStream and use that to deserialize your objects. However, this requires you have some control over the protocol. It also works easier if you use a TCP Socket. If you use a UDP DatagramSocket, you will need to get the data from the packet and then feed that into a ByteArrayInputStream.
If you don't have control over the protocol, you may be able to still use the above deserialization method, but you're probably going to have to implement the readObject() and writeObject() methods rather than using the default implementation given to you. If you need to use someone else's protocol (say because you need to interop with a native program), this is likely the easiest solution you are going to find.
Also, remember that Java uses UTF-16 internally for strings, but I'm not certain that it serializes them that way. Either way, you need to be very careful when passing strings back and forth to non-Java programs.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.