java binary files operations

java binary files operations - java

I have a bunch of different objects(and objec types) that i want to write to a binary file. First of all i need the file to be structured like this:
`Object type1
obj1, obj2 ...
Object type2
obj1, obj2...
....
Being a binary file this doesn't help a user read it, but i want to have a structure so i can search, delete or add an object by it's type, not parsing the entire file. And this is something i don't know how to do. Is this even posible?

You will have to maintain a header at the beginning of the file (or somewhere else) to mark the position and length of each of your objects.
The kind and layout of the header depend a lot on how you plan to read and write into the file. For example if you plan to retrieve the objects by name, you could have in your file something like this
object1 500 1050
object2 1550 800
object3 2350 2000
<some padding to cover 500 bytes>
<the 1050 bytes of object1><the 800 bytes of object2><the 2000 bytes of object3>
And know that object1 starts at the offset 400 in the file, and has a length of 1050 bytes.
Since it seems that you have different types of objects that you want to store, you will probably need to add some additional data to your header.
Take care of the following:
Each time you add, delete or modifiy a file, you will have to update in the header the offset for all files that follow (for example if I remove object2, then the offset for object3 is now 1550).
If you store the header in the same file as the data, then you must take the size of the header into account when computing offsets (this will make things much harder, I suggest you keep the header and binary data separated.
You will have to read and parse the header each time you want to access an object. Consider using a standardized format for your header to avoid problems (YML or XML).
I'm not aware of any library that will help you implement such a feature but I'm pretty sure there are some. Maybe someone will be able to suggest one.
--
Another solution would be to use something like a ZipFile (which is natively supported by Java) and write each of your objects as a differenz ZipEntry. This way you won't have to manage object separation yourself, and will only need to worry about knowing the exact ZipEntry you want.

Related

How can I update a serialized HashMap contained in a file?

I have a file that contains a serialized HashMap containing an element of type MyObject:
�� sr java.util.HashMap���`� F
loadFactorI thresholdxp?# w  t (a54d88e06612d820bc3be72877c74f257b561b19sr com.myproject.MyObject C�m�I�/ I partitionL hashcodet Ljava/lang/String;L idt Ljava/lang/Long;L offsetq ~ L timestampq ~ L topicq ~ xp q ~ ppppx
Now, I also have some other MyObject objects that I would like to add to that map. However, I dont want to first read and deserialize the map back into memory, then update it and then write the whole updated map back to file. How would one update the serialization in the file in a more efficient way?

How would one update the serialization in the file in a more efficient way?
Basically by reverse engineering the binary protocol that Java uses when serializing objects into their binary representation. That would enable you to understand which elements in that binary blob would need to be updated in which way.
Other people have already done that, see here for example.
Anything else is just work. You sitting down and writing code.
Or you write the few lines of code that read in the existing files, and write out a new file with that map plus the other object you need in there.
You see, efficiency depends on the point of view:
do you think the single update of a file with binary serialized objects is so time critical that it needs to be done by manually "patching" that binary file
do you think it is more efficient to spend hours and hours to learn the underlying binary format, to correctly update its content?
The only valid reason (I can think of) why to do that: to learn exactly such things: binary data formats, and how to patch content. But even then there might be "better" assignments that give you more insights (of real value in the real world) than ... spending your time re-implementing Java binary serialization.

How to use ANTLR4 with binary data?

From the homepage:
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for [...] or binary files.
I have read through the docs now for some hours and think that I have some basic understanding of ANTLR, but I have a hard time to find any references to processing binary files. And I'm not the only one as it seems.
I need to create a parser for some binary data and would like to decide if ANTLR is of any help or not.
Binary data structure
That binary data is structured in logical fields like field1, which is followed by field2, which is followed by field3 etc. and all those fields have a special purpose. The length of all those fields may differ AND may not be known at the time the parser is generated, so e.g. I do know that field1 is e.g. 4 bytes always, field2 might simply be 1 byte and field3 might be 1 to 10 bytes and might be followed by additional field3s with n bytes, depending on the actual value of the data. That is the second problem, I know the fields are there and e.g. with field1 I know it's 4 bytes, but I don't know the actual value, but that is what I'm interested in. Same goes for the other fields, I need the values from all of those.
What I need in ANTLR
This sounds like a common structure and use case for some arbitrary binary data to me, but I don't see any special handling of such data in ANTLR. All examples are using some kind of texts and I don't see some value extraction callbacks or such. Additionally, I think I would need some callbacks influencing the parsing process itself, so for e.g. one callback is called on the first byte of field3, I check that, decide that one to N additional bytes need to be consumed and that those are logically part of field3 and tell the parser that, so it's able to proceed "somehow".
In the end, I would get some higher level "field" objects and ANTLR would provide the underlying parse logic with callbacks and listener infrastructure, walking abilities etc.
Did anyone ever do something like that and can provide some hints to examples or the concrete documentation I seem to have missed? Thanks!
EN 13757-3:2012
I don't think it makes understanding my question really easier, but the binary data I'm referring to is defined in the standard EN 13757-3:2012:
Communication systems for and remote reading of meters - Part
3: Dedicated application layer
The standard is not freely available on the net (anymore?), but the following PDF might provide you an overview of how example data looks like in page 4. Especially that bytes of the mentioned fields are not constant, only the overall structure of the datagram is defined.
http://fastforward.ag/downloads/docu/FAST_EnergyCam-Protocol-wirelessMBUS.pdf
The tokens for the grammar would be the fields, implemented by a different amount of bytes, but with a value etc. Regarding the self-description of ANTLR, I would expected such things to work somehow...
Alternative: Kaitai.io
Whoever is in a comparable position like me currently, have a look at Kaitai.io, which reads very promising:
https://stackoverflow.com/a/40527106/2055163

Java NIO: Writing File Header - Using SeekableByteChannel

I am manually serializing data objects to a file, using a ByteBuffer and its operations such as putInteger(), putDouble() etc.
One of the fields I'd like to write-out is a String. For the sake of example, let's say this contains a currency. Each currency has a three-letter ISO currency code, e.g. GBP for British Pounds Sterling.
Assuming each object I'm serializing just has a double and a currency; you could consider the serialized data to look something like:
100.00|GBP
200.00|USD
300.00|EUR
Obviously in reality I'm not delimiting the data (the pipe between fields, nor the line feeds), it's stored in binary - just using the above as an illustration.
Encoding the currency with each entry is a bit inefficient, as I keep storing the same three-characters. Instead, I'd like to have a header - which stores a mapping for currencies. The file would look something like:
100
GBP
USD
EUR
~~~
~~~
100.00|1
200.00|2
300.00|3
The first 2 bytes in the file is a short, filled with the decimal value 100. This informs me that there are 100 spaces for currencies in the file. Following this, there are 3-byte chunks which are the currencies in order (ASCII-only characters).
When I read the file back in, all I have to do is build up a 100-element array with the currency codes, and I can cheaply / efficiently look up the relevant currency for each line.
Reading the file back-in seems simple. But I'm interested to hear thoughts on writing-out the data.
I don't know all the currencies up-front, and I'm actually supporting any three-character code - even if it's invalid. Thus I have to build-up the table converting currencies to indexes on-the-fly.
I am intending on using a SeekableByteChannel to address my file, and seeking back to the header every time I find a new currency I've not indexed before.
This has obvious I/O overhead of moving round the file. But, I am expecting to see all the different currencies within the first few data objects written. So it'll probably only seek for the first few seconds of execution, and then not have to perform an additional seek for hours.
The alternative is to wait for the stream of data to finish, and then write the header once. However, if my application crashes and I haven't written-out the header, the data in the file cannot be recovered back to its original content.
Seeking seems like the right thing to do, but I've not attempted it before - and was hoping to hear horror-stories up-front, rather than through trial/error on my end.

The problem with your approach is that you say that you do not want to limit the number of currency codes which implies that you don’t know how much space you have to reserve for the header. Seeking in a plain local file might be cheap if not performed too often, but shifting the entire file contents to reserve more room for the header is big.
The other question is how you define efficiency. If you don’t limit the number of currency codes you have to be aware of the case that a single byte is not sufficient for your index so you need either a dynamic possibly-multi-byte encoding which is more complicated to parse or a fixed multi-byte encoding which ends up taking the same number of bytes as the currency code itself.
So if not space-efficiency for the typical case is more important to you than decoding efficiency you can use the fact that these codes are all made up of ASCII characters only. So you can encode each currency code in three bytes and if you accept one padding byte you can use a single putInt/getInt for storing/retrieving a currency code without the need for any header lookup.
I don’t believe that optimizing these codes further would improve you storage significantly. The table does not consist of currency codes only. It’s very likely the other data will take much more space.

Optimized way of doing string.endsWith() work.

I need to look for all web requests received by Application Server to check if the URL has extensions like .css, .gif, etc
Referred how tomcat is listening for every request and they pick the right configured Servlet to serve.
CharChunk , MessageBytes , Mapper
Here is my idea to implement:
Load all the extensions we like to compare and get the byte
representation of them.
get a unique value for this xtension by summing up the bytes in the byte Array // eg: "css".getBytes()
Add the result value to Sorted List
Whenever we receive the request, get the byte representation of the URL // eg: "flipkart.com/eshopping/images/theme.css".getBytes()
Start summing the bytes from the byte array's last index and break when we encounter "." dot byte value
Search for existence of the value thus summed with the Sorted List // Use binary Search here
Kindly give your feed backs about the implementation and issues if any.
-With thanks, Krishna

This sounds way more complicated than it needs to be.
Use String.lastIndeXOf to find the last dot in the URL
Use String.substring to get the extension based on that
Have a HashSet<String> for a set of supported extensions, or a HashMap<String, Whatever> if you want to map the extension to something else
I would be absolutely shocked to discover that this simple approach turned out to be a performance bottleneck - and indeed I suspect it would be more efficient than the approach you suggested, given that it doesn't require the entire URL to be converted into a byte array... (It's not clear why your approach uses byte arrays anyway instead of forming the hash from char values.)
Fundamentally, my preferred approach to performance is:
Do up-front design and testing around things which are hard to change later, architecturally
For everything else:
Determine the performance criteria first so you know when you can stop
Write the simplest code that works
Test it with realistic data
If it doesn't perform well enough, use profilers (etc) to work out where the bottleneck is, and optimize that making sure that you can prove the benefits using your existing tests

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.

It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.

I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.

Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.