Library for serializing java objects to fixed-width byte arrays - java

I would like to store a very simple pojo object in binary format:
public class SampleDataClass {
private long field1;
private long field2;
private long field3;
}
To do this, I have written a simple serialize/deserialize pair of methods:
public class SampleDataClass {
// ... Fields as above
public static void deserialize(ByteBuffer buffer, SampleDataClass into) {
into.field1 = buffer.getLong();
into.field2 = buffer.getLong();
into.field3 = buffer.getLong();
}
public static void serialize(ByteBuffer buffer, SampleDataClass from) {
buffer.putLong(from.field1);
buffer.putLong(from.field2);
buffer.putLong(from.field3);
}
}
Simple and efficient, and most importantly the size of the objects in binary format is fixed. I know the size of each record serialized will be 3 x long, i.e. 3 x 8bytes = 24 bytes.
This is crucial, as I will be recording these sequentially and I need to be able to find them by index later on, i.e. "Find me the 127th record".
This is working fine for me, but I hate the boilerplate - and the fact that at some point I'm going to make a mistake and end up write a load of data that can't be read-back because there's an inconsistency between my serialize / deserialize method.
Is there a library that generate something like this for me?
Ideally I'm looking for something like protobuf, with a fixed-length encoding scheme. Later-on, I'd like to encode strings too. These will also have a fixed length. If a string exceeds the length it's truncated to n bytes. If a string is too short, I'll null-terminate it (or similar).
Finally, protobuf supports different versions of the protocol. It is inevitable I'll need to do that eventually.
I was hoping someone had a suggestion, before I start rolling-my-own

Make your class inherit the java.io.Serializable interface. Then you can use java.io.ObjectOutputStream and java.io.ObjectInputStream to serialize / deserialize objects to / from streams. The write and read methods take byte arrays as arguments.
To make it fixed length, standardize the size of the byte[] arrays used.

The most difficult part here is capping your strings or collections. You can do this with Kryo for Strings by overriding default serializers. Placing strings into a custom buffer class (i.e. FixedSerializableBuffer) which stores or is annotated with a length to cut also makes sense.
public class KryoDemo {
static class Foo{
String s;
long v;
Foo() {
}
Foo(String s, long v) {
this.s = s;
this.v = v;
}
#Override
public String toString() {
final StringBuilder sb = new StringBuilder("Foo{");
sb.append("s='").append(s).append('\'');
sb.append(", v=").append(v);
sb.append('}');
return sb.toString();
}
}
public static void main(String[] args) {
Kryo kryo = new Kryo();
Foo foo = new Foo("test string", 1);
kryo.register(String.class, new Serializer<String>() {
{
setImmutable(true);
setAcceptsNull(true);
}
public void write(Kryo kryo, Output output, String s) {
if (s.length() > 4) {
s = s.substring(0, 4);
}
output.writeString(s);
}
public String read(Kryo kryo, Input input, Class<String> type) {
return input.readString();
}
});
// serialization part, data is binary inside this output
ByteBufferOutput output = new ByteBufferOutput(100);
kryo.writeObject(output, foo);
System.out.println("before: " + foo);
System.out.println("after: " + kryo.readObject(new Input(output.toBytes()), Foo.class));
}
}
This prints:
before: Foo{s='test string', v=1}
after: Foo{s='test', v=1}

If the only additional requirement over standard serialization is efficient random access to the n-th entry, there are alternatives to fixed-size entries, and that you will be storing variable length entries (such as strings) makes me think that these alternatives deserve consideration.
One such alternative is to have a "directory" with fixed length entries, each of which points to the variable length content. Random access to an entry is then implemented by reading the corresponding pointer from the directory (which can be done with random access, as the directory entries are fixed size), and then reading the block it points to. This approach has the disadvantage that an additional I/O access is required to access the data, but permits a more compact representation of the data, as you don't have to pad variable length content, which in turn speeds up sequential reading. Of course, neither the problem nor the above solution is novel - file systems have been around for a long time ...

Related

Java structures for storing chess moves

I have an Integer[64] of numbers 0 - 6 which say what type of chess piece is there. I have a Boolean[64] of what color each place is. I need to be able to save them as (Strings?) and save them for later use, but I need a fast and efficient way. As of now I am looping through both arrays and creating a 64char String, but I make a few million of them because my chess AI looks deep into the game. Thoughts?
First of all you should redefine your data structure.
Instead of two arrays with integer and booleans you can define one array
byte[64] field;
Then add two methods that retrieve the information about the type and the color:
public int getType(int fieldNo) {
# this returns the first three bits (int 0-6)
return field[fieldNo] & 0x07;
}
public boolean getColor(int fieldNo) {
# this returns the fourth bit
return (field[fieldNo] & 0x08) > 0;
}
You can now save the complete chess field just by writing/reading the fields array:
public byte[] readField(String file) throws IOException {
byte[] field = new short[64];
try (DataInputStream stream = new DataInputStream(new FileInputStream(file)); ) {
stream.readFully(field,0,64);
}
return field;
}
public void writeField(String file, byte[] field) throws IOException {
try (DataOutputStream stream = new DataOutputStream(new FileOutputStream(file)); ) {
stream.write(field,0,64);
}
return field;
}
This saves a complete field in 64 bytes.
More improvements:
Compress the 64 byte filed when saving more than one field to one file. Compression should be good because most of your bytes have value 0.
Instead of using byte[64] you can use byte[32] only and map the information to the first / last 4 bits of one byte.

Java Changing variable from another class and the depending values

I have a problem with my class variables, as always ^^
So I'm constructing a class named Prng, with variables
private int randListSize = 10;
private byte randList[] = new byte[randListSize];
private byte[] seed = new byte[]{ 34, -70, -4, 117, 98 };
the getters/setters associated
and the method
public void prng() {
SecureRandom random;
try {
random = SecureRandom.getInstance("SHA1PRNG");
random.setSeed(seed);
random.nextBytes(randList);
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
}
in another class named Test.java, I want to :
1) set randListSize to /number of random bytes I want
2) have the randList of this size, and not from original 10 size
whenever I try, my randList is always of size 10. Can you help me please ?
in my class Test I've written :
Prng prng = new Prng();
System.out.println(prng.getRandListSize() + " " + prng.getRandList().length);
prng.setRandListSize(11);
System.out.println(prng.getRandListSize()+ " " + prng.getRandList().length);
which returns me "10 10 ; 11 10" and I want "11 11" at the end.
EDIT : here's my getters/setters :
public int getRandListSize() {
return randListSize;
}
public void setRandListSize(int randListSize) {
this.randListSize = randListSize;
}
public byte[] getSeed() {
return seed;
}
public void setSeed(byte[] seed) {
this.seed = seed;
}
public byte[] getRandList() {
return randList;
}
public void setRandList(byte[] randList) {
this.randList = randList;
}
First, randListSize, in my opinion, is a useless field, as that property can be retrieved directly from the array, and as the operation isn't expensive the value doesn't need to be cached. Thus, you really don't need getters/setters for that field either. I see you're using it as an initial size variable, but in that case I think it'd be better for it to be a parameter for a constructor/factory method instead, as it really doesn't need to be used anywhere else.
Second, setRandListSize() doesn't actually change randList's size, as arrays, once created, cannot be structurally modified (i.e. you can't make arrays longer/shorter after creating them). You're just changing an unrelated variable, which leads to some confusion once randListSize stops matching randList.length. This is the reason you're seeing 11 10 instead of 11 11 -- randListSize is only used at the moment of array creation, and later changes to randListSize don't affect the array.
In order to get the result you want, you're going to have to allocate an entirely new array and set randList to point to it instead of your old one, which you can do using your setRandList() method. Alternatively, you can write a method, perhaps called createNewRandList(int newLength), to do all the work at once.
Your setRandListSize method will need to recreate the randList array. If you need to keep the data in it, your method should copy whatever data can fit into the new array.
public void setRandListSize(int randListSize) {
this.randListSize = randListSize;
this.randList = new byte[randListSize];
}

Java how common is extending/wrapping built-in classes

I'm new to the Java language and I've tried to write my first relatively complex program. After I wrote a few classes I've realized that I barely use built-in classes (like BigInteger, MessageDigest, ByteBuffer) directly because they don't totally fit my needs. Instead I write my own class and inside the class I use the built-in class as an attribute.
Example:
public class SHA1 {
public static final int SHA_DIGEST_LENGTH = 20;
private MessageDigest md;
public SHA1() {
try {
md = MessageDigest.getInstance("SHA-1");
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
}
public void update(byte[] data) {
md.update(data);
}
public void update(BigNumber bn) {
md.update(bn.asByteArray());
}
public void update(String data) {
md.update(data.getBytes());
}
public byte[] digest() {
return md.digest();
}
}
With the following simple class I don't have to use try catch when using SHA1, I can put my custom BigNumber class as parameter and I can also put String as parameter to update function.
The following BigNumber class contains all of the functions what I need and exactly how I need them.
public class BigNumber {
private BigInteger m_bn;
public BigNumber() {
m_bn = new BigInteger("0");
}
public BigNumber(BigInteger bn) {
m_bn = bn;
}
public BigNumber(String hex) {
setHexStr(hex);
}
//reversed no minsize
public byte[] asByteArray() {
return asByteArray(0, true);
}
//reversed with minsize
public byte[] asByteArray(int minSize) {
return asByteArray(minSize, true);
}
public byte[] asByteArray(int minSize, boolean rev) {
byte[] mag = m_bn.toByteArray();
//delete sign bit
//there is always a sign bit! so if bitNum % 8 is zero then
//the sign bit created a new byte (0th)
if(getNumBits() % 8 == 0) {
byte[] tmp = new byte[mag.length-1];
System.arraycopy(mag, 1, tmp, 0, mag.length-1);
mag = tmp;
}
//extend the byte array if needed
int byteSize = (minSize >= getNumBytes()) ? minSize : getNumBytes();
byte[] tmp = new byte[byteSize];
//if tmp's length smaller then byteSize then we keep 0x00-s from left
System.arraycopy(mag, 0, tmp, byteSize-mag.length, mag.length);
if(rev) ByteManip.reverse(tmp);
return tmp;
}
public String asHexStr() {
return ByteManip.byteArrayToHexStr(asByteArray(0, false));
}
public void setHexStr(String hex) {
m_bn = new BigInteger(hex, 16);
}
public void setBinary(byte[] data) {
//reverse = true
ByteManip.reverse(data);
//set as hex (binary set has some bug with the sign bit...)
m_bn = new BigInteger(ByteManip.byteArrayToHexStr(data), 16);
}
public void setRand(int byteSize) {
byte[] tmp = new byte[byteSize];
new Random().nextBytes(tmp);
//reversing byte order, but it doesn't really matter since it is a random
//number
setBinary(tmp);
}
public int getNumBytes() {
return (m_bn.bitLength() % 8 == 0) ? (m_bn.bitLength() / 8) : (m_bn.bitLength() / 8 + 1);
}
public int getNumBits() {
return m_bn.bitLength();
}
public boolean isZero() {
return m_bn.equals(BigInteger.ZERO);
}
//operations
public BigNumber modExp(BigNumber exp, BigNumber mod) {
return new BigNumber(m_bn.modPow(exp.m_bn, mod.m_bn));
}
public BigNumber mod(BigNumber m) {
return new BigNumber(m_bn.mod(m.m_bn));
}
public BigNumber add(BigNumber bn) {
return new BigNumber(m_bn.add(bn.m_bn));
}
public BigNumber subtract(BigNumber bn) {
return new BigNumber(m_bn.subtract(bn.m_bn));
}
public BigNumber multiply(BigNumber bn) {
return new BigNumber(m_bn.multiply(bn.m_bn));
}
}
My question is that how common in Java language to use these kind of classes instead of the built-in classes? Does it make my code unreadable for other programmers (compared to implementing everything with built-in classes)?
I've read that new C++ programmers desperately trying to write codes they used to write in C therefore the benefits of C++ remains hidden for them.
I'm afraid I do something like that in Java: trying to implement everything on my own instead of using the build-in classes directly.
Is this happening (for example in the BigNumber class)?
Thank you for your opinions!
I normally write a utility class which will support me to handle logics. Such as
public class CommonUtil{
public byte[] asByteArray(int minSize)
{
return "something".getBytes();
}
// add more utility methods
}
Wrapping a class makes sense when you add some value by doing so. If you are adding small functionality it can be worth using a Utility class instead of wrapping an existing one.
I think that if you do not have a very good reason for implementing the same functionality again you should not probably do it. Here are several reasons:
Built-in classes are used by a lot of people around the world and therefore there are less bugs than in your code
Users that are experienced in Java will be better in using standard classes than your classes and they will need less time to understand your code and write something new in your project
Built-in classes have good documentations and therefore it is much easier to use them
You are wasting your time by implementing something that was implemented and tested by Java professionals. It is better to concentrate on your own project
If you are writing a long-term project you will need to support all your classes. Oracle is already supporting built-in classes. Let them do their job!
The last but not the least. Are you sure that you know more about the problem than an author of a built-in class? Only if the answer is yes, consider writing your own implementation. Even implementation of daily used classes, such as collections or time-related classes can be tricky.
You're not gaining anything by making a class that does this stuff for you. If you're going to be doing certain operations a lot, then you might want to create a new class with static methods that do these important things for you.
Let's assume that you want a sorted array at all times. You could make a new class, let's call it SortedArray. You could sort it whenever you add something in, but why would you do that when you can just add in everything and then call the (utility) method Arrays.sort?
For common operations, take a look at Java's Arrays class - if you are doing something often, that's something you make a method for, like searching and sorting. In your case, you might make a utility method that turns the BigInteger into a byte array for you. You shouldn't be just making your own, 'better' version that does what you want it. When other people look at your code, when you use standard objects it's much better, instead of having custom objects that don't really do anything.
As #Shark commented, there's no point in creating your own solutions, because:
They take time to create
They become not as flexible
However, you can extend classes (it's recommended) or use 3rd party frameworks that might suit you better.

Are there any tricks to reduce memory usage when storing String data type in hashmap?

I need to store value pair (word and number) in the Map.
I am trying to use TObjectIntHashMap from Trove library with char[] as the key, because I need to minimize the memory usage. But with this method, I can not get the value when I use get() method.
I guess I can not use primitive char array to store in a Map because hashcode issues.
I tried to use TCharArrayList but that takes much memory also.
I read in another stackoverflow question that similar with my purpose and have suggestion to use TLongIntHashMap , store encode values of String word in long data type. In this case my words may contains of latin characters or various other characters that appears in wikipedia collections, I do not know whether the Long is enough for encode or not.
I have tried using Trie data structure to store it, but I need to consider my performance also and choose the best for both memory usage and performance.
Do you have any idea or suggestion for this issue?
It sounds like the most compact way to store the data is to use a byte[] encoded in UTF-8 or similar. You can wrap this in your own class or write you own HashMap which allows byte[] as a key.
I would reconsider how much time it is worth spending to save some memory. If you are talking about a PC or Server, at minimum wage you need to save 1 GB for an hours work so if you are only looking to save 100 MB that's about 6 minutes including testing.
Write your own class that implements CharSequence, and write your own implementation of equals() and hashcode(). The implementation would also pre-allocate large shared char[] storage, and use bits of it at a time. (You can definitely incorporate #Peter Lawrey's excellent suggestion into this, too, and use byte[] storage.)
There's also an opportunity to do a 'soft intern()' using an LRU cache. I've noted where the cache would go.
Here's a simple demonstration of what I mean. Note that if you need heavily concurrent writes, you can try to improve the locking scheme below...
public final class CompactString implements CharSequence {
private final char[] _data;
private final int _offset;
private final int _length;
private final int _hashCode;
private static final Object _lock = new Object();
private static char[] _storage;
private static int _nextIndex;
private static final int LENGTH_THRESHOLD = 128;
private CompactString(char[] data, int offset, int length, int hashCode) {
_data = data; _offset = offset; _length = length; _hashCode = hashCode;
}
private static final CompactString EMPTY = new CompactString(new char[0], 0, 0, "".hashCode());
private static allocateStorage() {
synchronized (_lock) {
_storage = new char[1024];
_nextIndex = 0;
}
}
private static CompactString storeInShared(String value) {
synchronized (_lock) {
if (_nextIndex + value.length() > _storage.length) {
allocateStorage();
}
int start = _nextIndex;
// You would need to change this loop and length to do UTF encoding.
for (int i = 0; i < value.length(); ++i) {
_storage[_nextIndex++] = value.charAt(i);
}
return new CompactString(_storage, start, value.length(), value.hashCode());
}
}
static {
allocateStorage();
}
public static CompactString valueOf(String value) {
// You can implement a soft .intern-like solution here.
if (value == null) {
return null;
} else if (value.length() == 0) {
return EMPTY;
} else if (value.length() > LENGTH_THRESHOLD) {
// You would need to change .toCharArray() and length to do UTF encoding.
return new CompactString(value.toCharArray(), 0, value.length(), value.hashCode());
} else {
return storeInShared(value);
}
}
// left to reader: implement equals etc.
}

Is There a More Efficient Way to Convert Between ArrayList and Array

Using Java, I have a class which retrieves a webpage as a byte array. I then need to strip out some content if it exists. (The application monitors web pages for changes, but needs to remove session Ids from the html which are created by php, and would mean changes were detected each visit to the page).
Some of the resulting byte arrays could be 10s of 1000s bytes long. They're not stored like this - a 16 byte MD5 of the page is stored. However, it is the original full size byte array which needs to be processed.
(UPDATE - the code does not work. See comment from A.H. below)
A test showing my code:
public void testSessionIDGetsRemovedFromData() throws IOException
{
byte[] forumContent = "<li class=\"icon-logout\">Logout [ barry ]</li>".getBytes();
byte[] sidPattern = "&sid=".getBytes();
int sidIndex = ArrayCleaner.getPatternIndex(forumContent, sidPattern);
assertEquals(54, sidIndex);
// start of cleaning code
ArrayList<Byte> forumContentList = new ArrayList<Byte>();
forumContentList.addAll(forumContent);
forumContentList.removeAll(Arrays.asList(sidPattern));
byte[] forumContentCleaned = new byte[forumContentList.size()];
for (int i = 0; i < forumContentCleaned.length; i++)
{
forumContentCleaned[i] = (byte)forumContentList.get(i);
}
//end of cleaning code
sidIndex = ArrayCleaner.getPatternIndex(forumContentCleaned, sidPattern);
assertEquals(-1, sidIndex);
}
This all works fine, but I'm worried about the efficiency of the cleaning section. I had hoped to operate solely on arrays, but the ArrayList has nice built in functions to removed a collection from the ArrayList, etc, which is just what I need. So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?), convert the pattern to remove to another ArrayList (I suppose this could be an ArrayList all along) to pass to removeAll(). I then need to create another byte[] and cast each element of the ArrayList of Bytes to a byte and add it to the byte[].
Is there a more efficient way of doing all this?
Can it be performed using arrays?
UPDATE
This is the same functionality using strings:
public void testSessionIDGetsRemovedFromDataUsingStrings() throws IOException
{
String forumContent = "<li class=\"icon-logout\">Logout [ barry ]</li>";
String sidPattern = "&sid=";
int sidIndex = forumContent.indexOf(sidPattern);
assertEquals(54, sidIndex);
forumContent = forumContent.replaceAll(sidPattern, "");
sidIndex = forumContent.indexOf(sidPattern);
assertEquals(-1, sidIndex);
}
Is this as efficient as the array/arrayList method?
Thanks,
Barry
You can use List#toArray() to convert any list to an array.
Things are a bit more complicated in this specific use case because there is no elegant way to auto-unbox (from Byte to byte) when converting the list. Good ol' Java generics. Which is a nice segue into...
So I have had to create an ArrayList of Byte, as I can't have an ArrayList of the primitive byte (can anyone tell me why?)
Because, in Java, generic type parameters cannot be primitives. See Why can Java Collections not directly store Primitives types?
Side note: as a matter of style, you should almost always declare ArrayList types as List:
List<Byte> forumContentList = new ArrayList<Byte>();
See Java - declaring from Interface type instead of Class and Type List vs type ArrayList in Java.
This all works fine, I'm worried about the efficiency of the cleaning section...
Really? Did you inspect the resulting "string"? On my machine the data in forumContentCleaned still contains the &sid=... data.
That's because
forumContentList.removeAll(Arrays.asList(sidPattern));
tries to remove a List<byte[]> from a List<Byte>. This will do nothing. And even if you replace the argument of removeAll with a real List<Byte> containing the bytes of "&sid=", then you will remove ALL occurences of each a, each m, each p and so forth. The resulting data will look like this:
<l cl"con-logout">< href"./uc.h?oelogout34043284674572e35881e022c68fc8" ttle....
Well, strictly speaking, the &sid= part is gone, but I'm quite sure this is not what you wanted.
Therefore take a step back and think: You are doing string manipulation here, so use a StringBuilder, feed it with the String(forumContent) and do your manipulation there.
Edit
Looking at the given example input string, I guess, that also the value of sid should be removed, not only the key. This code should do it efficiently without regular expresions:
String removeSecrets(String input){
StringBuilder sb = new StringBuilder(input);
String sidStart = "&sid=";
String sidEnd = "\"";
int posStart = 0;
while ((posStart = sb.indexOf(sidStart, posStart)) >= 0) {
int posEnd = sb.indexOf(sidEnd, posStart);
if (posEnd < 0) // delete as far as possible - YMMV
posEnd = sb.length();
sb.delete(posStart, posEnd);
}
return sb.toString();
}
Edit 2
Here is a small benchmark between StringBuilder and String.replaceAll:
public class ReplaceAllBenchmark {
public static void main(String[] args) throws Throwable {
final int N = 1000000;
String input = "<li class=\"icon-logout\">Logout [ barry ]&sid=3a4043284674572e35881e022c68fcd8\"</li>";
stringBuilderBench(input, N);
regularExpressionBench(input, N);
}
static void stringBuilderBench(String input, final int N) throws Throwable{
for(int run=0; run<5; ++run){
long t1 = System.nanoTime();
for(int i=0; i<N; ++i)
removeSecrets(input);
long t2 = System.nanoTime();
System.out.println("sb: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
Thread.sleep(1000);
}
}
static void regularExpressionBench(String input, final int N) throws Throwable{
for(int run=0; run<5; ++run){
long t1 = System.nanoTime();
for(int i=0; i<N; ++i)
removeSecrets2(input);
long t2 = System.nanoTime();
System.out.println("regexp: "+(t2-t1)+"ns, "+(t2-t1)/N+"ns/call");
Thread.sleep(1000);
}
}
static String removeSecrets2(String input){
return input.replaceAll("&sid=[^\"]*\"", "\"");
}
}
Results:
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.9) (6b20-1.9.9-0ubuntu1~10.04.2)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
sb: 538735438ns, 538ns/call
sb: 457107726ns, 457ns/call
sb: 443282145ns, 443ns/call
sb: 453978805ns, 453ns/call
sb: 458895308ns, 458ns/call
regexp: 2404818405ns, 2404ns/call
regexp: 2196834572ns, 2196ns/call
regexp: 2239056178ns, 2239ns/call
regexp: 2164337638ns, 2164ns/call
regexp: 2177091893ns, 2177ns/call
I dont think two codes have the same function.
the first code removes all characters in the sidPattern from forumContent.
the second code removes the sidPattern string from forumContnt, maybe not functional, cause replaceAll() accept the argument as regular expression pattern.
are you sure you want to remove "&sid=" rather than "&sid=3a4043284674572e35881e022c68fcd8" ?
anyway, I think String is fine, List is a little bit heavy.

Categories