Java NIO: Writing File Header - Using SeekableByteChannel

Java NIO: Writing File Header - Using SeekableByteChannel - java

I am manually serializing data objects to a file, using a ByteBuffer and its operations such as putInteger(), putDouble() etc.
One of the fields I'd like to write-out is a String. For the sake of example, let's say this contains a currency. Each currency has a three-letter ISO currency code, e.g. GBP for British Pounds Sterling.
Assuming each object I'm serializing just has a double and a currency; you could consider the serialized data to look something like:
100.00|GBP
200.00|USD
300.00|EUR
Obviously in reality I'm not delimiting the data (the pipe between fields, nor the line feeds), it's stored in binary - just using the above as an illustration.
Encoding the currency with each entry is a bit inefficient, as I keep storing the same three-characters. Instead, I'd like to have a header - which stores a mapping for currencies. The file would look something like:
100
GBP
USD
EUR
~~~
~~~
100.00|1
200.00|2
300.00|3
The first 2 bytes in the file is a short, filled with the decimal value 100. This informs me that there are 100 spaces for currencies in the file. Following this, there are 3-byte chunks which are the currencies in order (ASCII-only characters).
When I read the file back in, all I have to do is build up a 100-element array with the currency codes, and I can cheaply / efficiently look up the relevant currency for each line.
Reading the file back-in seems simple. But I'm interested to hear thoughts on writing-out the data.
I don't know all the currencies up-front, and I'm actually supporting any three-character code - even if it's invalid. Thus I have to build-up the table converting currencies to indexes on-the-fly.
I am intending on using a SeekableByteChannel to address my file, and seeking back to the header every time I find a new currency I've not indexed before.
This has obvious I/O overhead of moving round the file. But, I am expecting to see all the different currencies within the first few data objects written. So it'll probably only seek for the first few seconds of execution, and then not have to perform an additional seek for hours.
The alternative is to wait for the stream of data to finish, and then write the header once. However, if my application crashes and I haven't written-out the header, the data in the file cannot be recovered back to its original content.
Seeking seems like the right thing to do, but I've not attempted it before - and was hoping to hear horror-stories up-front, rather than through trial/error on my end.

The problem with your approach is that you say that you do not want to limit the number of currency codes which implies that you don’t know how much space you have to reserve for the header. Seeking in a plain local file might be cheap if not performed too often, but shifting the entire file contents to reserve more room for the header is big.
The other question is how you define efficiency. If you don’t limit the number of currency codes you have to be aware of the case that a single byte is not sufficient for your index so you need either a dynamic possibly-multi-byte encoding which is more complicated to parse or a fixed multi-byte encoding which ends up taking the same number of bytes as the currency code itself.
So if not space-efficiency for the typical case is more important to you than decoding efficiency you can use the fact that these codes are all made up of ASCII characters only. So you can encode each currency code in three bytes and if you accept one padding byte you can use a single putInt/getInt for storing/retrieving a currency code without the need for any header lookup.
I don’t believe that optimizing these codes further would improve you storage significantly. The table does not consist of currency codes only. It’s very likely the other data will take much more space.

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?

tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

Does bson require more space then json when sending decimal data? [duplicate]

I am comparing JSON and BSON for serializing objects. These objects contain several arrays of a large number of integers. In my test the object I am serializing contains a total number of about 12,000 integers. I am only interested in how the sizes compare of the serialized results. I am using JSON.NET as the library which does the serialization. I am using JSON because I also want to be able to work with it in Javascript.
The size of the JSON string is about 43kb and the size of the BSON result is 161kb. So a difference factor of about 4. This is not what I expected because I looked at BSON because I thought BSON is more efficient in storing data.
So my question is why is BSON not efficient, can it be made more efficient? Or is there another way of serializing data with arrays containing large number of integers, which can be easily handled in Javascript?
Below you find the code to test the JSON/BSON serialization.
// Read file which contain json string
string _jsonString = ReadFile();
object _object = Newtonsoft.Json.JsonConvert.DeserializeObject(_jsonString);
FileStream _fs = File.OpenWrite("BsonFileName");
using (Newtonsoft.Json.Bson.BsonWriter _bsonWriter = new BsonWriter(_fs)
{ CloseOutput = false })
{
Newtonsoft.Json.JsonSerializer _jsonSerializer = new JsonSerializer();
_jsonSerializer.Serialize(_bsonWriter, _object);
_bsonWriter.Flush();
}
Edit:
Here are the resulting files
https://skydrive.live.com/redir?resid=9A6F31F60861DD2C!362&authkey=!AKU-ZZp8C_0gcR0

The efficiency of JSON vs BSON depends on the size of the integers you're storing. There's an interesting point where ASCII takes fewer bytes than actually storing integer types. 64-bit integers, which is how it appears your BSON document, take up 8 bytes. Your numbers are all less than 10,000, which means you could store each one in ASCII in 4 bytes (one byte for each character up through 9999). In fact, most of your data look like it's less than 1000, meaning it can be stored in 3 or fewer bytes. Of course, that deserialization takes time and isn't cheap, but it saves space. Furthermore, Javascript uses 64-bit values to represent all numbers, so if you wrote it to BSON after converting each integer to a more appropriate dataformat, your BSON file could be much larger.
According to the spec, BSON contains a lot of metadata that JSON doesn't. This metadata is mostly length prefixes so that you can skip through data you aren't interested in. For example, take the following data:
["hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
"oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
"data_you_care_about"]
Now, if you're using JSON, you have to parse the entirety of the first two strings to find out where the third one is. If you use BSON, you'll get markup more like (but not actually, because I'm making this markup up for the sake of example):
[175 "hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
169 "oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
19 "data_you_care_about"]
So now, you can read '175', know to skip forward 175 bytes, then read '169', skip forward 169 bytes, and then read '19' and copy the next 19 bytes to your string. That way you don't even have to parse the strings for delimiters.
Using one versus the other is very dependent on what your needs are. If you're going to be storing enormous documents that you've got all the time in the world to parse, but your disk space is limited, use JSON because it's more compact and space efficient.
If you're going to be storing documents, but reducing wait time (perhaps in a server context) is more important to you than saving some disk space, use BSON.
Another thing to consider in your choice is human readability. If you need to debug a crash report that contains BSON, you'll probably need a utility to decipher it. You probably don't just know BSON, but you can just read JSON.
FAQ

Performance of HashMap

I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?

It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.

If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.

Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.

I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.

Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.

Map a set of strings with similarities to shorter strings

i have a set of strings each of same length (10chars) with the following properties.
The size of the set is around 5000 - 10,000 strings. The data set can change frequently.
Although each string is unique, a sub string of a particular pattern would appear in most of these strings not necessarily at the same position.
Some examples are
123abc7gh0
t123abcmla
wp12123abc
123abc being the substring which appears in most of the strings
The problem is to map each string to a shorter string, and such mapping should be deterministic.
I could use a simple enumeration algorithm which maps each string encountered to an incremented counter value(on set of sorted strings). But since the set is bound to change frequently, i cannot use this algorithm to compute the map in a deterministic way for various runs.
I could also use data compression algorithm like Huffman encoding to compress each individual string. But i do not believe that would be effective as each string in itself has very less duplicate characters.
what should be the approach i should adapt to solve the problem by taking advantage of the properties of the data set? Note that i do not want to compress the whole set of data but would like to map each string in the set to a shortened string.

Replace the 'common string' by a character not appearing elsewhere in any string.
Do a probabilistic analysis of all strings
Create a Hufman tree based on the analysis, i.e. most frequent characters are at the top of the tree, resulting in short codes.
Replace sample strings by their hufman encoding based on the tree of #3 and compare the resulting size with the original. If most of the characters are uniformly spread even between the strings, then the Hufman coding will not reduce but increase the size.
If Hufman does not gain any improvement, you might try LZW or any other dictionary based compression method. However, this only works, if the structure of the strings (i.e. the distribution of characters/substrings) does not completely change over time. For example, if the strings would consist of english words, the substring dictionary compression (LZW) might be a good candidate.
But if the distribution changes or the character distribution is merely equal over all characters, I am afraid there is no compression method suitable of reducing the string size.
But the last question remains: What for? Why bother compressing 10000 strings?
Edit: The answer is: The strings are used to create folder names (paths). As there is a restriction on the total length, it should be as compact as possible.
You might try to create a database (i.e. dictionary) and use the index (coded e.g. as Base64) as a compressed string. This gives you a maximum of 5 chars when assuming a maximum dictionary size of 2^32-1.

If you can pre-process the set of strings and could know the pattern which occurs in each of the strings, you could treat that as a single character (use some encoding) which would shorten that string.

I'm confronted with the same kind of task and wonder whether it is possible to achieve the mapping without making use of persistence.
If persisting the mappings in (previous) use is allowed, then the solution is simple:
you can just assign a number to each of the strings (using a representation of a sufficiently high base so that you get the required maximum size of the numbers' string representation). For each of the source strings you would assign a next number and using the persisted mappings make sure not to use the same number a second time.
This policy would give you consistent results, even if you go through the procedure multiple times with a changing set of data: a string occurring for the first time would receive its private number and this private number would stay reserved to it forever - numbers that are no longer in use would never be reused.
The more challenging question is: is it possible to guarantee uniqueness without the aid of a persisted mapping? I'm afraid it is not, since size reduction is always prone to lead to collisions.

Is it possible to search a file with comrpessed objects in java?

I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.

You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.