I'm required to work on a serialization library in Java which must be as fast as possible. The idea is to create various methods which will serialize the specified value and its associated key and puts them in a byte buffer. Several objects that wrap this buffer must be created since the objects that need to be serialized are potentially alot.
Considerations:
I know the Unsafe class may not be implemented in every JVM, but it's not a problem.
Premature optimization: this library has to be fast and this serialization is the only thing it has to do.
The objects once serialized are tipically small (less than 10k) but they are alot and they can be up to 2Gb big.
The underlying buffer can be expanded / reduced but I'll skip implementation details, the method is similar to the one used in the ArrayList implementation.
To clarify my situation: I have various methods like
public void putByte(short key, byte value);
public void putInt(short key, int value);
public void putFloat(short key, float value);
... and so on...
these methods append the key and the value in a byte stream, so if i call putInt(-1, 1234567890) my buffer would look like: (the stream is big endian)
key the integer value
[0xFF, 0xFF, 0x49, 0x96, 0x02, 0xD2]
In the end a method like toBytes() must be called to return a byte array which is a trimmed (if needed) version of the underlying buffer.
Now, my question is: what is the fastest way to do this in java?
I googled and stumbled upon various pages (some of these were on SO) and I also did some benchmarks (but i'm not really experienced in benchmarks and that's one of the reasons I'm asking for the help of more experienced programmers about this topic).
I came up with the following solutions:
1- The most immediate: a byte array
If I have to serialize an int it would look like this:
public void putInt(short key, int value)
{
array[index] = (byte)(key >> 8);
array[index+1] = (byte) key;
array[index+2] = (byte)(value >> 24);
array[index+3] = (byte)(value >> 16);
array[index+4] = (byte)(value >> 8);
array[index+5] = (byte) value;
}
2- A ByteBuffer (be it direct or a byte array wrapper)
The putInt method would look like the following
public void putInt(short key, int value)
{
byteBuff.put(key).put(value);
}
3- Allocation on native memory through Unsafe
Using the Unsafe class I would allocate the buffer on native memory and so the putInt would look like:
public void putInt(short key, int value)
{
Unsafe.putShort(address, key);
Unsafe.putInt(address+2, value);
}
4- allocation through new byte[], access through Unsafe
I saw this method in the lz4 compression library written in java. Basically once a byte array is instantiated i write bytes the following way:
public void putInt(short key, int value)
{
Unsafe.putShort(byteArray, BYTE_ARRAY_OFFSET + 0, key);
Unsafe.putInt(byteArray, BYTE_ARRAY_OFFSET + 2, value);
}
The methods here are simplified, but the basic idea is the one shown, I also have to implement the getter methods . Now, since i started to work in this i learnt the following things:
1- The JVM can remove array boundary checks if it's safe (in a for loop for example where the counter has to be less to the length of the array)
2- Crossing the JVM memory boundaries (reading/writing from/to native memory) has a cost.
3- Calling a native method may have a cost.
4- Unsafe putters and getters don't make boundary checks in native memory, nor on a regular array.
5- ByteBuffers wrap a byte array (non direct) or a plain native memory area (direct) so case 2 internally would look like case 1 or 3.
I run some benchmarks (but as I said I would like the opinion / experience of other developers) and it seems that case 4 is slightly (almost equals) to case 1 in reading and about 3 times faster in writing. It also seems that a for loop with Unsafe read and write (case 4) to copy an array to another (copying 8 bytes at time) is faster than System.arraycopy.
Long story made short (sorry for the long post):
case 1 seems to be fast, but that way I have to write a single byte each time + masking operations, which makes me think that maybe Unsafe, even if it's a call to native code may be faster.
case 2 is similar to case 1 and 3, so I could skip it (correct me if I'm missing something)
case 3 seems to be the slowest (at least from my benchmarks), also, I would need to copy from a native memory to a byte array because that's must be the output. But here this programmer claims it's the fastest way by far. If I understood correctly, what am I missing?
case 4 (as supported here) seems to be the fastest.
The number of choices and some contradictory information confuse me a bit, so can anyone clarify me these doubts?
I hope I wrote every needed information, otherwise just ask for clarifications.
Thanks in advance.
Case 5: DataOutputStream writing to a ByteArrayOutputStream.
Pro: it's already done; it's as fast as anything else you've mentioned here; all primitives are already implemented. The converse is DataInputStream reading from a ByteArrayInputStream.
Con: nothing I can think of.
Related
Java's String and StringBuilder are limited to a length of Integer.MAX_VALUE. In most use cases this is more than adequate, but I have just encountered a use case in which I need to handle and return a String greater than 2,684,354,560 characters.
This is required for capturing an incoming stream of characters, in which I do not have control over the size of the stream, nor do I have the option of re-architecting the solution. What I can do at most is replace a method in an existing module, or introduce a new class that replaces String and StringBuilder in that method.
As a temporary workaround, to prevent the OutOfMemory exception thrown when the StringBuilder length exceeds Integer.MAX_VALUE, I implemented the follow safeAppend():
private void safeAppend(StringBuilder ret, String current) {
if ((long)ret.length() + current.length() > Integer.MAX_VALUE) {
String truncateLeadingPart;
if (current.length() < ret.length()) {
truncateLeadingPart = ret.substring(current.length());
}
else {
int startIndex = (int)((long)ret.length()+current.length()-Integer.MAX_VALUE);
truncateLeadingPart = ret.substring(Math.min(ret.length(), startIndex));
}
ret.setLength(0);
ret.append(truncateLeadingPart);
}
ret.append(current);
}
This methods truncates the leading part and always keeps the trailing 2,147,483,647 characters part. However, this workaround/safeguard proved to be inadequate for the task at hand because we cannot afford losing any data captured from the stream.
What is a recommended approach to implementing a String and StringBuilder that are NOT limited by an int max size?
A limit of a long max size could be sufficient. Also, a single LimitlessString class that can be appended efficiently like StringBuilder is also adequate.
You wont be able to String or StringBuffer as the 32-bit length is baked into the interface. That's also true of arrays and NIO buffers, unfortunately (there have been proposals to fix this, but nothing at the time of writing).
Obviously streaming or using random file access would be a good solution if that is possible.
You are left with implementing something else. Ropes use a binary tree to represent composition of string parts. More common is to use an array of arrays, or for better GC an array of directly allocated (or memory-mapped file) NIO buffers. Someone remarked a few years ago that this area of Computer Science still has scope for more PhDs.
Well, if you Really-Really need to extend String/StringBuilder classes in such way you have to either create new class, that won't extend String/StringBuilder, because thay are marked as final, or you can change JRE binaries to make String/StringBuilder non-final. Anyway, both solutions sucks and will lead to huge support effort and will generate a lot of WTFs in future.
String and StringBuilder are final classes and cannot be patched. StringWriter would have been better.
Nice would have been:
not using two-byte chars, but bytes (CharBuffer upon ByteBuffer);
compressing (GzipOutputStream);
(as you did) periodically remove a huge chunk to a file or such;
[An aside] in the newer java there is support for single byte encodings which would not allow more characters but would use half the memory.
You'll meet resizing on appending, so the system will slow down.
I know this sounds like a broad question but I can narrow it down with an example. I am VERY new at Java. For one of my "learning" projects, I wanted to create an in-house MD5 file hasher for us to use. I started off very simple by attempting to hash a string and then moving on to a file later. I created a file called MD5Hasher.java and wrote the following:
import java.security.*;
import java.io.*;
public class MD5Hasher{
public static void main(String[] args){
String myString = "Hello, World!";
byte[] myBA = myString.getBytes();
MessageDigest myMD;
try{
myMD = MessageDigest.getInstance("MD5");
myMD.update(myBA);
byte[] newBA = myMD.digest();
String output = newBA.toString();
System.out.println("The Answer Is: " + output);
} catch(NoSuchAlgorithmException nsae){
// print error here
}
}
}
I visited java.sun.com to view the javadocs for java.security to find out how to use MessageDigest class. After reading I knew that I had to use a "getInstance" method to get a usable MessageDigest object I could use. The Javadoc went on to say "The data is processed through it using the update methods." So I looked at the update methods and determined that I needed to use the one where I fed it a byte array of my string, so I added that part. The Javadoc went on to say "Once all the data to be updated has been updated, one of the digest methods should be called to complete the hash computation." I, again, looked at the methods and saw that digest returned a byte array, so I added that part. Then I used the "toString" method on the new byte array to get a string I could print. However, when I compiled and ran the code all that printed out was this:
The Answer Is: [B#4cb162d5
I have done some looking around here on StackOverflow and found some information here:
How can I generate an MD5 hash?
that gave the following example:
String plaintext = 'your text here';
MessageDigest m = MessageDigest.getInstance("MD5");
m.reset();
m.update(plaintext.getBytes());
byte[] digest = m.digest();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
It seems the only part I MAY be missing is the "BigInteger" part, but I'm not sure.
So, after all of this, I guess what I am asking is, how do you know to use the "BigInteger" part? I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java? I have a background in C so this Java thing seems pretty weird. Any advice on how I can get better without having to "cheat" by Googling how to do something all the time?
Thank you all for taking the time to read. :-)
The key in this particular case is that you need to realize that bytes are not "human readable", but characters are. So you need to convert bytes to characters in a certain format. For arbitrary bytes like hashes, usually hexadecimal is been used as "human readable" format. Every byte is then to be converted to a 2-character hexadecimal string which you in turn concatenate together.
This is unrelated to the language you use. You just have to understand/realize how it works "under the hoods" in a language agnostic way. You have to understand what you have (a byte array) and what you want (a hexstring). The programming language is just a tool to achieve the desired result. You just google the "functional requirement" along with the programming language you'd like to use to achieve the requirement. E.g. "convert byte array to hex string in java".
That said, the code example you found is wrong. You should actually determine each byte inside a loop and test if it is less than 0x10 and then pad it with zero instead of only padding the zero depending on the length of the resulting string (which may not necessarily be caused by the first byte being less than 0x10!).
StringBuilder hex = new StringBuilder(bytes.length * 2);
for (byte b : bytes) {
if ((b & 0xff) < 0x10) hex.append("0");
hex.append(Integer.toHexString(b & 0xff));
}
String hexString = hex.toString();
Update as per the comments on the answer of #extraneon, using new BigInteger(byte[]) is also the wrong solution. This doesn't unsign the bytes. Bytes (as all primitive numbers) in Java are signed. They have a negative range. The byte in Java ranges from -128 to 127 while you want to have a range of 0 to 255 to get a proper hexstring. You basically just need to remove the sign to make them unsigned. The & 0xff in the above example does exactly that.
The hexstring as obtained from new BigInteger(bytes).toString(16) is NOT compatible with the result of all other hexstring producing MD5 generators the world is aware of. They will differ whenever you've a negative byte in the MD5 digest.
You have actually successfully digested the message. You just don't know how to present the found digest value properly. What you have is a byte array. That's a bit difficult to read, and a toString of a byte array yields [B#somewhere which is not useful at all.
The BigInteger comes into it as a tool to format the byte array to a single number.
What you do is:
construct a BigInteger with the proper value (in this case that value happens to be encoded in the form of a byte array - your digest
Instruct the BigInteger object to return a String representation (e.g. plain, readable text) of that number, base 16 (e.g. hex)
And the while loop prefixes that value with 0-characters to get a width of 32. I'd probably use String.format for that, but whatever floats your boat :)
MessageDigests compute a byte array of something, the string that you usually see (such as 1f3870be274f6c49b3e31a0c6728957f) is actually just a conversion of the byte array to a hexadecimal string.
When you call MessageDigest.toString(), it calls MessageDigest.digest().toString(), and in Java, the toString method for a byte[] (returned by MessageDigest.digest()) returns a sort of reference to the bytes, not the actual bytes.
In the code you posted, the byte array is changed to an integer (in this case a BigInteger because it would be extremely large), and then converted to hexadecimal to be printed to a String.
The byte array computed by the digest represents a number (a 128-bit number according to http://en.wikipedia.org/wiki/MD5), and that number can be converted to any other base, so the result of the MD5 could be represented as a base-10 number, a base-2 number (as in a byte array), or, most commonly, a base-16 number.
It is OK to google for answers as long as you (eventually) understand what you copy-pasted into your app :-)
In general, I recommend starting with a good Java introductory book, or web tutorial. See these threads for more tips:
https://stackoverflow.com/questions/77839/what-are-the-best-resources-for-learning-java-books-websites-etc
Learning Java
https://stackoverflow.com/questions/78293/good-book-to-learn-to-program-well-in-java-engineering-or-architecture-wise-not
Though I'm afraid that I have no experience whatsoever using Java to play with MD5 hashes, I can recommend Sun's Java Tutorials as a fantastic resource for learning Java. They go through most of the language, and helped me out a ton when I was learing Java.
Also look around for other posts asking the same thing and see what suggestions popped up there.
The reason BigInteger is used is because the byte array is very long, too big too fit into an int or long. However, if you do want to see everything in the byte array, there's an alternate approach. You could just replace the line:
String output = newBA.toString();
with:
String output = Arrays.toString(newBA);
This will print out the contents of the array, not the reference address.
Use an IDE that shows you where the "toString()" method is coming from. In most cases it's just from the Object class and won't be very useful. It's generally recommended to overwrite the toString-method to provide some clean output, but many classes don't do this.
I'm also a newbie to development. For the current problem, I suggest the Book "Introduction To Cryptography With Java Applets" by David Bishop. It demonstrates what you need and so forth...
Any advice on how I can get better
without having to "cheat" by Googling
how to do something all the time?
By by not starting out with an MD5 hasher! Seriously, work your way up little by little on programs that you can complete without worrying about domain-specific stuff like MD5.
If you're dumping everything into main, you're not programming Java.
In a program of this scale, your main() should do one thing: create an MD5Hasher object and then call some methods on it. You should have a constructor that takes an initial string, a method to "do the work" (update, digest), and a method to print the result.
Get some tutorials and spend time on simple, traditional exercises (a Fibonacci generator, a program to solve some logic puzzle), so you understand the language basics before bothering with the libraries, which is what you are struggling with now. Then you can start doing useful stuff.
I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java?
You could replace here Java with the language of your choice that you don't know/haven't mastered yet. Even if you worked 10 years in a specific language, you will still get those "Aha! This is the way it's working!"-effects, though not that often as in the beginning.
The point you need to learn here is that toString() is not returning the representation you want/expect, but any the implementer has chosen. The default implementation of toString() is like this (javadoc):
Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. The result should be a concise but informative representation that is easy for a person to read. It is recommended that all subclasses override this method.
The toString method for class Object returns a string consisting of the name of the class of which the object is an instance, the at-sign character `#', and the unsigned hexadecimal representation of the hash code of the object. In other words, this method returns a string equal to the value of:
getClass().getName() + '#' + Integer.toHexString(hashCode())
How is a person supposed to know which
way to go in Java? I have a background
in C so this Java thing seems pretty
weird. Any advice on how I can get
better without having to "cheat" by
Googling how to do something all the
time?
Obvious answers are 1- google when you have questions (and it's not considered cheating imo) and 2- read books on the subject matter.
Apart from these two, I would recommend trying to find a mentor for yourself. If you do not have experienced Java developers at work, then try to join a local Java developer user group. You can find more experienced developers there and perhaps pick their brains to get answers to your questions.
I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.
ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.
What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]
If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment
Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).
Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)
I have a C structure that is sent over some intermediate networks and gets received over a serial link by a java code. The Java code gives me a byte array that I now want to repackage it as the original structure. Now if the receive code was in C, this was simple. Is there any simple way to repackage a byte[] in java to a C struct. I have minimal experience in java but this doesnt appear to be a common problem or solved in any FAQ that I could find.
FYI the C struct is
struct data {
uint8_t moteID;
uint8_t status; //block or not
uint16_t tc_1;
uint16_t tc_2;
uint16_t panelTemp; //board temp
uint16_t epoch#;
uint16_t count; //pkt seq since the start of epoch
uint16_t TEG_v;
int16_t TEG_c;
}data;
I would recommend that you send the numbers across the wire in network byte order all the time. This eliminates the problems of:
Compiler specific word boundary generation for your structure.
Byte order specific to your hardware (both sending and receiving).
Also, Java's numbers are always stored in network-byte-order no matter the platform that you run Java upon (the JVM spec requires a specific byte order).
A very good class for extracting bits from a stream is java.nio.ByteBuffer, which can wrap arbitrary byte arrays; not just those coming from a I/O class in java.nio. You really should not hand code your own extraction of primitive values if at all possible (i.e. bit shifting and so forth) since it is easy to get this wrong, the code is the same for every instance of the same type, and there are plenty of standard classes that provide this for you.
For example:
public class Data {
private byte moteId;
private byte status;
private short tc_1;
private short tc_2;
//...etc...
private int tc_2_as_int;
private Data() {
// empty
}
public static Data createFromBytes(byte[] bytes) throws IOException {
final Data data = new Data();
final ByteBuffer buf = ByteBuffer.wrap(bytes);
// If needed...
//buf.order(ByteOrder.LITTLE_ENDIAN);
data.moteId = buf.get();
data.status = buf.get();
data.tc_1 = buf.getShort();
data.tc_2 = buf.getShort();
// ...extract other fields here
// Example to convert unsigned short to a positive int
data.tc_2_as_int = buf.getShort() & 0xffff;
return data;
}
}
Now, to create one, just call Data.createFromBytes(byteArray).
Note that Java does not have unsigned integer variables, but these will be retrieved with the exact same bit pattern. So anything where the high-order bit is not set will be exactly the same when used. You will need to deal with the high-order bit if you expected that in your unsigned numbers. Sometimes this means storing the value in the next larger integer type (byte -> short; short -> int; int -> long).
Edit: Updated the example to show how to convert a short (16-bit signed) to an int (32-bit signed) with the unsigned value with tc_2_as_int.
Note also that if you cannot change the byte-order and it is not in network order, then java.nio.ByteBuffer can still serve you here with buf.order(ByteOrder.LITTLE_ENDIAN); before retrieving the values.
This can be difficult to do when sending from C to C.
If you have a data struct, cast it so that you end up with an array of bytes/chars and then you just blindly send it you can sometimes end up with big problems decoding it on the other end.
This is because sometimes the compiler has decided to optimize the way that the data is packed in the struct, so in raw bytes it may not look exactly how you expect it would look based on how you code it.
It really depends on the compiler!
There are compiler pragma's you can use to make packing unoptimized. See C/C++ Preprocessor Reference - pack
The other problem is the 32/64-bit bit problem if you just use "int", and "long" without specifying the number of bytes... but you have done that :-)
Unfortunately, Java doesnt really have structs... but it represents the same information in classes.
What I recommend is that you make a class that consists of your variables, and just make a custom unpacking function that will pull the bytes out from the received packet (after you have checked its correctness after transfer) and then load them in to the class.
e.g. You have a data class like
class Data
{
public int moteID;
public int status; //block or not
public int tc_1;
public int tc_2;
}
Then when you receive a byte array, you can do something like this
Data convertBytesToData(byte[] dataToConvert)
{
Data d = Data();
d.moteId = (int)dataToConvert[0];
d.status = (int)dataToConvert[1];
d.tc_1 = ((int)dataToConvert[2] << 8) + dataTocConvert[3]; // unpacking 16-bits
d.tc_2 = ((int)dataToConvert[4] << 8) + dataTocConvert[5]; // unpacking 16-bits
}
I might have the 16-bit unpacking the wrong way around, it depends on the endian of your C system, but you'll be able to play around and see if its right or not.
I havent played with Java for sometime, but hopefully there might be byte[] to int functions built in these days.
I know there are for C# anyway.
With all this in mind, if you are not doing high data rate transfers, definately look at JSON and Protocol Buffers!
Assuming you have control over both ends of the link, rather than sending raw data you might be better off going for an encoding that C and Java can both use. Look at either JSON or Protocol Buffers.
What you are trying to do is problematic for a couple of reasons:
Different C implementations will represent uint16_t (and int16_t) values in different ways. In some cases, the most significant byte will be first when the struct is laid out in memory. In other cases, the least significant byte will.
Different C compilers may pack the fields of the struct differently. So it is possible (for example) that the fields have been reordered or padding may have been added.
So what this all means is that you have to figure out exactly the struct is laid out ... and just hope that this doesn't change when / if you change C compilers or C target platform.
Having said that, I could not find a Java library for decoding arbitrary binary data streams that allows you to select "endian-ness". The DataInputStream and DataOutputStream classes may be the answer, but they are explicitly defined to send/expect the high order byte first. If your data comes the other way around you will need to do some Java bit bashing to fix it.
EDIT : actually (as #Kevin Brock points out) java.nio.ByteBuffer allows you to specify the endian-ness when fetching various data types from a binary buffer.
In C if you have a certain type of packet, what you generally do is define some struct and cast the char * into a pointer to the struct. After this you have direct programmatic access to all data fields in the network packet. Like so :
struct rdp_header {
int version;
char serverId[20];
};
When you get a network packet you can do the following quickly :
char * packet;
// receive packet
rdp_header * pckt = (rdp_header * packet);
printf("Servername : %20.20s\n", pckt.serverId);
This technique works really great for UDP based protocols, and allows for very quick and very efficient packet parsing and sending using very little code, and trivial error handling (just check the length of the packet). Is there an equivalent, just as quick way in java to do the same ? Or are you forced to use stream based techniques ?
Read your packet into a byte array, and then extract the bits and bytes you want from that.
Here's a sample, sans exception handling:
DatagramSocket s = new DatagramSocket(port);
DatagramPacket p;
byte buffer[] = new byte[4096];
while (true) {
p = new DatagramPacket(buffer, buffer.length);
s.receive(p);
// your packet is now in buffer[];
int version = buffer[0] << 24 + buffer[1] << 16 + buffer[2] < 8 + buffer[3];
byte[] serverId = new byte[20];
System.arraycopy(buffer, 4, serverId, 0, 20);
// and process the rest
}
In practise you'll probably end up with helper functions to extract data fields in network order from the byte array, or as Tom points out in the comments, you can use a ByteArrayInputStream(), from which you can construct a DataInputStream() which has methods to read structured data from the stream:
...
while (true) {
p = new DatagramPacket(buffer, buffer.length);
s.receive(p);
ByteArrayInputStream bais = new ByteArrayInputStream(buffer);
DataInput di = new DataInputStream(bais);
int version = di.readInt();
byte[] serverId = new byte[20];
di.readFully(serverId);
...
}
I don't believe this technique can be done in Java, short of using JNI and actually writing the protocol handler in C. The other way to do the technique you describe is variant records and unions, which Java doesn't have either.
If you had control of the protocol (it's your server and client) you could use serialized objects (inc. xml), to get the automagic (but not so runtime efficient) parsing of the data, but that's about it.
Otherwise you're stuck with parsing Streams or byte arrays (which can be treated as Streams).
Mind you the technique you describe is tremendously error prone and a source of security vulnerabilities for any protocol that is reasonably interesting, so it's not that great a loss.
I wrote something to simplify this kind of work. Like most tasks, it was much easier to write a tool than to try to do everything by hand.
It consisted of two classes, Here's an example of how it was used:
// Resulting byte array is 9 bytes long.
byte[] ba = new ByteArrayBuilder()
.writeInt(0xaaaa5555) // 4 bytes
.writeByte(0x55) // 1 byte
.writeShort(0x5A5A) // 2 bytes
.write( (new BitBuilder()) // 2 bytes---0xBA12
.write(3, 5) // 101 (3 bits value of 5)
.write(2, 3) // 11 (2 bits value of 3)
.write(3, 2) // 010 (...)
.write(2, 0) // 00
.write(2, 1) // 01
.write(4, 2) // 0002
).getBytes();
I wrote the ByteArrayBuilder to simply accumulate bits. I used a method chaining pattern (Just returning "this" from all methods) to make it easier to write a bunch of statements together.
All the methods in the ByteArrayBuilder were trivial, just like 1 or 2 lines of code (I just wrote everything to a data output stream)
This is to build a packet, but tearing one apart shouldn't be any harder.
The only interesting method in BitBuilder is this one:
public BitBuilder write(int bitCount, int value) {
int bitMask=0xffffffff;
bitMask <<= bitCount; // If bitcount is 4, bitmask is now ffffff00
bitMask = ~bitMask; // and now it's 000000ff, a great mask
bitRegister <<= bitCount; // make room
bitRegister |= (value & bitMask); // or in the value (masked for safety)
bitsWritten += bitCount;
return this;
}
Again, the logic could be inverted very easily to read a packet instead of build one.
edit: I had proposed a different approach in this answer, I'm going to post it as a separate answer because it's completely different.
Look at the Javolution library and its struct classes, they will do just what you are asking for. In fact, the author has this exact example, using the Javolution Struct classes to manipulate UDP packets.
This is an alternate proposal for an answer I left above. I suggest you consider implementing it because it would act pretty much the same as a C solution where you could pick fields out of a packet by name.
You might start it out with an external text file something like this:
OneByte, 1
OneBit, .1
TenBits, .10
AlsoTenBits, 1.2
SignedInt, +4
It could specify the entire structure of a packet, including fields that may repeat. The language could be as simple or complicated as you need--
You'd create an object like this:
new PacketReader packetReader("PacketStructure.txt", byte[] packet);
Your constructor would iterate over the PacketStructure.txt file and store each string as the key of a hashtable, and the exact location of it's data (both bit offset and size) as the data.
Once you created an object, passing in the bitStructure and a packet, you could randomly access the data with statements as straight-forward as:
int x=packetReader.getInt("AlsoTenBits");
Also note, this stuff would be much less efficient than a C struct, but not as much as you might think--it's still probably many times more efficient than you'll need. If done right, the specification file would only be parsed once, so you would only take the minor hit of a single hash lookup and a few binary operations for each value you read from the packet--not bad at all.
The exception is if you are parsing packets from a high-speed continuous stream, and even then I doubt a fast network could flood even a slowish CPU.
Short answer, no you can't do it that easily.
Longer answer, if you can use Serializable objects, you can hook your InputStream up to an ObjectInputStream and use that to deserialize your objects. However, this requires you have some control over the protocol. It also works easier if you use a TCP Socket. If you use a UDP DatagramSocket, you will need to get the data from the packet and then feed that into a ByteArrayInputStream.
If you don't have control over the protocol, you may be able to still use the above deserialization method, but you're probably going to have to implement the readObject() and writeObject() methods rather than using the default implementation given to you. If you need to use someone else's protocol (say because you need to interop with a native program), this is likely the easiest solution you are going to find.
Also, remember that Java uses UTF-16 internally for strings, but I'm not certain that it serializes them that way. Either way, you need to be very careful when passing strings back and forth to non-Java programs.