I read from ORACLE of the following bit:
Can I execute methods on compressed versions of my objects, for example isempty(zip(serial(x)))?
This is not really viable for arbitrary objects because of the encoding of objects. For a particular object (such as String) you can compare the resulting bit streams. The encoding is stable, in that every time the same object is encoded it is encoded to the same set of bits.
So I got this idea, say if I have a char array of 4M something long, is it possible for me to compress it to several hundreds of bytes using GZIPOutputStream, and then map the whole file into memory, and do random search on it by comparing bits? Say if I am looking for a char sequence of "abcd", could I somehow get the bit sequence of compressed version of "abcd", and then just search the file for it? Thanks.
You cannot use GZIP or similar to do this as the encoding of each byte change as the stream is processed. i.e. the only way to determine what a byte means is to read all the bytes previous.
If you want to access the data randomly, you can break the String into smaller sections. That way you only need to decompress a relative short section of data.
Related
I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?
tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).
From https://docs.oracle.com/javase/tutorial/jndi/objects/serial.html
To serialize an object means to convert its state to a byte stream so
that the byte stream can be reverted back into a copy of the object.
Since everything is stored in memory as 0s and 1s, why is there an additional need to deconstruct an object to a form that can be transmitted over a stream? Why is the existing state not good enough for transmission?
If you find 01000001, how do you know if it is the number 65 or ASCII for A? Or perhaps a frequency in a wav file? It might be color information in a bitmap.
There are so many ways information can be interpreted. You must give the receiver a way of interpreting the information. My first example is perhaps a little silly. Instead, compare a CSV file with a JSON file. Try recreating a JSON structure in a CSV. It won't be pretty.
The same logic lies behind Java's serialization. How do you know what the class definition looks like?
Main reason: the data representing one instance isn't contiguous.
In
To serialize an object means to convert its state to a byte stream so
that the byte stream can be reverted back into a copy of the object.
the main key word is "byte stream", meaning a sequence of bytes.
Your approach matches the concept of languages like Pascal and C, where most data structures finally get represented as a contiguous block of bytes.
With languages like Java, instance fields very often don't contain the field values, but reference other instances containing the values or referencing further instances, and so on. Even a simple class having just one String field ends up in (at least) three different instances:
the main instance, holding a reference to the String.
a String instance, having a reference to the array of characters,
a char[] array holding the contents of the string.
And there's no guarantee whatsoever where these instances get located in memory, most probably not in consecutive locations. So, the in-memory representation indeed consists of bytes, but not in a sequential arrangement.
That's why we talk about "serialization", this process creates a serial stream of bytes out of something that more resembles a web of interlinked elements.
I am comparing JSON and BSON for serializing objects. These objects contain several arrays of a large number of integers. In my test the object I am serializing contains a total number of about 12,000 integers. I am only interested in how the sizes compare of the serialized results. I am using JSON.NET as the library which does the serialization. I am using JSON because I also want to be able to work with it in Javascript.
The size of the JSON string is about 43kb and the size of the BSON result is 161kb. So a difference factor of about 4. This is not what I expected because I looked at BSON because I thought BSON is more efficient in storing data.
So my question is why is BSON not efficient, can it be made more efficient? Or is there another way of serializing data with arrays containing large number of integers, which can be easily handled in Javascript?
Below you find the code to test the JSON/BSON serialization.
// Read file which contain json string
string _jsonString = ReadFile();
object _object = Newtonsoft.Json.JsonConvert.DeserializeObject(_jsonString);
FileStream _fs = File.OpenWrite("BsonFileName");
using (Newtonsoft.Json.Bson.BsonWriter _bsonWriter = new BsonWriter(_fs)
{ CloseOutput = false })
{
Newtonsoft.Json.JsonSerializer _jsonSerializer = new JsonSerializer();
_jsonSerializer.Serialize(_bsonWriter, _object);
_bsonWriter.Flush();
}
Edit:
Here are the resulting files
https://skydrive.live.com/redir?resid=9A6F31F60861DD2C!362&authkey=!AKU-ZZp8C_0gcR0
The efficiency of JSON vs BSON depends on the size of the integers you're storing. There's an interesting point where ASCII takes fewer bytes than actually storing integer types. 64-bit integers, which is how it appears your BSON document, take up 8 bytes. Your numbers are all less than 10,000, which means you could store each one in ASCII in 4 bytes (one byte for each character up through 9999). In fact, most of your data look like it's less than 1000, meaning it can be stored in 3 or fewer bytes. Of course, that deserialization takes time and isn't cheap, but it saves space. Furthermore, Javascript uses 64-bit values to represent all numbers, so if you wrote it to BSON after converting each integer to a more appropriate dataformat, your BSON file could be much larger.
According to the spec, BSON contains a lot of metadata that JSON doesn't. This metadata is mostly length prefixes so that you can skip through data you aren't interested in. For example, take the following data:
["hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
"oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
"data_you_care_about"]
Now, if you're using JSON, you have to parse the entirety of the first two strings to find out where the third one is. If you use BSON, you'll get markup more like (but not actually, because I'm making this markup up for the sake of example):
[175 "hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
169 "oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
19 "data_you_care_about"]
So now, you can read '175', know to skip forward 175 bytes, then read '169', skip forward 169 bytes, and then read '19' and copy the next 19 bytes to your string. That way you don't even have to parse the strings for delimiters.
Using one versus the other is very dependent on what your needs are. If you're going to be storing enormous documents that you've got all the time in the world to parse, but your disk space is limited, use JSON because it's more compact and space efficient.
If you're going to be storing documents, but reducing wait time (perhaps in a server context) is more important to you than saving some disk space, use BSON.
Another thing to consider in your choice is human readability. If you need to debug a crash report that contains BSON, you'll probably need a utility to decipher it. You probably don't just know BSON, but you can just read JSON.
FAQ
I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.
One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.
We shouldn't use byte Stream as Sun Doc says -
actually it represents a kind of low-level I/O that you should avoid.
What is actually low-level I/O and what is exact problem using byte stream.
So the Java docs say:
CopyBytes seems like a normal program, but it actually represents a
kind of low-level I/O that you should avoid. Since xanadu.txt contains
character data, the best approach is to use character streams, as
discussed in the next section. There are also streams for more
complicated data types. Byte streams should only be used for the most
primitive I/O.
The byte streams give you access to the file as it is. Just the bytes. No interpration of any kind. That means no character set conversion, no handling of ints or floats in binary or ascii representation, no dealing with byte orders, or any of that. The higher level streams provide some of these.
Of course a program that copies a file is actually a pretty good example of something that needs a raw byte stream, because it doesn't need or want to do any kind of intepretation of the data; it just wants to copy it verbatim.
So what the really mean is, use byte streams if you think you need them, but be sure you know what you are doing :)
The suggestion is in the context of reading a text file that is discussed in the tutorial. For that purpose it is better to use character streams to handle character set translation properly:
The Java platform stores character values using Unicode conventions.
Character stream I/O automatically translates this internal format to
and from the local character set.
A program that uses character streams in place of byte streams
automatically adapts to the local character set and is ready for
internationalization — all without extra effort by the programmer.