Compressing a byte array in Java and decompressing in C - java

I currently have the following array in a Java program,
byte[] data = new byte[800];
and I'd like to compress it before sending it to a microcontroller over serial (115200 Baud). I would like to then decompress the array on the microcontroller in C. However, I'm not quite sure what the best way to do this is. Performance is an issue since the microcontroller is just an arduino so it can't be too memory/cpu intensive. The data is more or less random (edit I guess it's not really that random, see the edit below) I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
edit
Sorry about the lack of info. I need the compression to be lossless and I do only intend to send 800 bytes at a time. My issue is that 800 bytes won't transfer fast enough at the rate of 115200 baud that I am using. I was hoping I could shrink the size a little bit to improve speed.
Every two bytes looks like:
0RRRRRGGGGGBBBBB
Where R G and B bits represent the values for color channels red, green, and blue respectively. Every two bytes is then an individual LED on a 20x20 grid. I would imagine that many sets of two bytes would be identical since I frequently assign the same color codes to multiple LEDs. It may also be the case that RGB values are often > 15 since I typically use bright colors when I can (However, this might be a moot point since they are not all typically > 15 at once).

If the data is "more or less random" then you won't have much luck compressing it, I'm afraid.
UPDATE
Given the new information, I bet you don't need 32k colours on your LED display. I'd imagine that a 1024- or 256-colour palette might be sufficient. Hence you could get away with a trivial compression scheme (simply map each word through a lookup table, or possibly just discard lsbs of each component), that would work even for completely uncorrelated pixel values.

Use miniLZO compression. Java version C version

A really simple compression/decompression algorithm that is practical in tiny embedded environments and is easy to "roll your own" is run length encoding. Basically this means replacing a run of duplicate values with a (count, value) pair. Of course you need a sentinel (magic) value to introduce the pair, and then a mechanism to allow the magic value to appear in normal data (typically an escape sequence can be used for both jobs). In your example it might be best to use 16 bit values (2 bytes).
But naturally it all depends on the data. Data that is sufficiently random is incompressible by definition. You would do best to collect some example data first, then evaluate your compression options.
Edit after extra information posted
Just for fun and to show how easy run length encoding is I have coded up something. I'm afraid I've used C for compression as well, since I'm not a Java guy. To keep things simple I've worked entirely with 16 bit data. An optimization would be to use an 8 bit count in the (count,value) pair. I haven't tried to compile or test this code. See also my comment to your question about the possible benefits of mangling the LED addresses.
#define NBR_16BIT_WORDS 400
typedef unsigned short uint16_t;
// Return number of words written to dst (always
// less than or equal to NBR_16BIT_WORDS)
uint16_t compress( uint16_t *src, uint16_t *dst )
{
uint16_t *end = (src+NBR_16BIT_WORDS);
uint16_t *dst_begin = dst;
while( src < end )
{
uint16_t *temp;
uint16_t count=1;
for( temp=src+1; temp<end; temp++ )
{
if( *src == *temp )
count++;
else
break;
}
if( count < 3 )
*dst++ = *src++;
else
{
*dst++ = (*src)|0x8000;
*dst++ = count;
*src += count;
}
}
return dst-dst_begin;
}
void decompress( uint16_t *src, uint16_t *dst )
{
uint16_t *end_src = (src+NBR_16BIT_WORDS);
uint16_t *end_dst = (dst+NBR_16BIT_WORDS);
while( src<end_src && dst<end_dst )
{
data = *src++;
if( (data&0x8000) == 0 )
*dst++ = data;
else
{
data &= 0x7fff;
uint16_t count = *src++;
while( dst<end_dst && count-- )
*dst++ = data;
}
}
}

One of the first things to do would be to convert from RGB to YUV, or YCrCb, or something on that order. Having done that, you can usually get away with sub-sampling the U and V (or Cr/Cb) channels to half resolution. This is quite common in most types of images (e.g., JPEG, and MPEG both do it, and so do the sensors in most digital cameras).
Realistically, starting with only 800 bytes of data, most other forms of compression are going to be a waste of time and effort. You're going to have to put in quite a bit of work before you accomplish much (and keeping it reasonably fast on a Arduino won't be trivial for either).
Edit: okay, if you're absolutely certain you can't modify the data at all, things get more difficult very quickly. The real question at that point is what kind of input you're dealing with. Others have already mentioned the possibility of something on the order of a predictive delta compression -- e.g., based on preceding pixels, predict what the next one is likely to be, and then encode only the difference between the prediction and the actual value. Getting the most out of that, however, generally requires running the result through some sort of entropy-based algorithm like Shannon-Fanno or Huffman compression. Those, unfortunately, aren't usually the fastest to decompress though.
If your data is most things like charts or graphs, where you can expect to have large areas of identical pixels, run-length (or run-end) encoding can work pretty well. This does have the advantage of being really trivial to decompress as well.
I doubt that LZ-based compression is going to work so well though. LZ-based compression works (in general) by building a dictionary of strings of bytes that have been seen, and when/if the same string of bytes is seen again, transmitting the code assigned to the previous instance instead of re-transmitting the entire string. The problem is that you can't transmit uncompressed bytes -- you start out by sending the code word that represents that byte in the dictionary. In your case, you could use (for example) a 10-bit code word. This means the first time you send any particularly character, you need to send it as 10 bits, not just 8. You only start to get some compression when you can build up some longer (two-byte, three-byte, etc.) strings in your dictionary, and find a matching string later in the input.
This means LZ-based compression usually gets fairly poor compression for the first couple hundred bytes or so, then about breaks even for a while, and only after it's been running across some input for a while does it really start to compress well. Dealing with only 800 bytes at a time, I'm not at all sure you're ever going to see much compression -- in fact, working in such small blocks, it wouldn't be particularly surprising to see the data expand on a fairly regular basis (especially if it's very random).

The data is more or less random I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
Ideally you can compress 800 bytes of colour data to one byte if the whole image is the same colour. As Oli Charlesworth mentions however, the more random the data, the less you can compress it. If your images looks like static on a TV, then indeed, good luck getting any compression out of it.

Definitely consider Oli Charlesworth's answer. On a 20x20 grid, I don't know if you need a full 32k color palette.
Also, in your earlier question, you said you were trying to run this on a 20ms period (50 Hz). Do you really need that much speed for this display? At 115200 bps, you can transmit ~11520 bytes/sec - call it 10KBps for a margin of safety (e.g. your micro might have a delay between bytes, you should do some experiments to see what the 'real' bandwidth is). At 50 Hz, this only allows you about 200 bytes per packet - you're looking for a compression ratio over 75%, which may not be attainable under any circumstances. You seem pretty married to your requirements, but it may be time for an awkward chat.
If you do want to go the compression route, you will probably just have to try several different algorithms with 'real' data, as others have said, and try different encodings. I bet you can find some extra processing time by doing matrix math, etc. in between receiving bytes over the serial link (you'll have about 80 microseconds between bytes) - if you use interrupts to read the serial data instead of polling, you can probably do pretty well by using a double buffer and processing/displaying the previous buffer while reading into the current buffer.
EDIT:
Is it possible to increase the serial port speed beyond 115200? This USB-serial adapter at Amazon says it goes up to 1 Mbps (probably actually 921600 bps). Depending on your hardware and environment, you may have to worry about bad data, but if you increase the speed enough, you could probably add a checksum, and maybe even limited error correction.
I'm not familiar with the Arduino, but I've got an 8-bit FreeScale HCS08 I drive at 1.25 Mbps, although the bus is actually running RS-485, not RS-232 (485 uses differential signaling for better noise performance), and I don't have any problems with noise errors. You might even consider a USB RS-485 adapter, if you can wire that to your Arduino (you'd need conversion hardware to change the 485 signals to the Arduino's levels).
EDIT 2:
You might also consider this USB-SPI/I2C adapter, if you have an available I2C or SPI interface, and you can handle the wiring. It says it can go to 400 kHz I2C or 200 kHz SPI, which is still not quite enough by itself, but you could split the data between the SPI/I2C and the serial link you already have.

LZ77/78 are relatively easy to write http://en.wikipedia.org/wiki/LZ77_and_LZ78
However given the small amount of data you're transferring, its probably not worth compressing it at all.

Related

RandomAccessFile with support beyond Long?

I'm currently using an instance of RandomAccessFile to manage some in-memory data, but the size of my RandomAccessFile instance is beyond 2^64 bytes, so I cannot used methods such as seek() and write() because they use Long and cannot manage an address space bigger than 2^64. So what do I do ? Is there something else I can use which supports an address space beyond 2^64 ?
EDIT: Reason for asking this question:
I have a Tree data structure which in theory can have upto 2^128 nodes, and I want to store this tree onto a file. Each node has data that's roughly 6 bytes. So I'm wondering how will I store this tree to file.
Not a proper answer, but are you sure your file is actually this large?
From the docs for Long.MAX_VALUE:
A constant holding the maximum value a long can have, 2^63-1.
From the docs for RandomAccessFile.length():
the length of this file, measured in bytes.
Do you know how many bytes 2^63-1 is? Rather, 9,223,372,036,854,775,807 bytes?
9,223,372,036,854,775,807 B
9,223,372,036,854,775 KB
9,223,372,036,854 MB
9,223,372,036 GB
9,223,372 TB
9,223 PB
9 EB
If I math'd correctly, you would need a constant write speed of about 272GB/s for 1 year.
While this is an excellent question I would like to see an answer to, I highly doubt that you have a single file that will be 9EB in size, if the OS will even support this.
edit
Here are some File System Limits, and much to my own surprise, NTFS will actually support single files up to 16EiB, however that is only one of only a few on the list that do support it.
If you ABSOLUTELY need to access a file larger then 9EiB, it looks like you might need to roll your own version of RandomAccessFile, using BigInteger where the other uses long. This could get you up to (2 ^ 32) ^ Integer.MAX_VALUE bytes.
I suppose that your question borns from this requirement "Is there something else I can use which supports an address space beyond".
In another word, you want to access to memory by address, and your address could be big.
Of course, you should not allocate 2^128 * 6 bytes file, even if it would be possible nowadays, it would be too expensive. The typical approach here is split your storage into smaller parts and address it accordingly.
For instance
write(partition, address, node);
node = read(partition, address);
As you said, you should store IPv6 addresses. To store IPv6 and search fast over it is enough to have a table with 8 columns and indexes for each part of an ipv6 address. Or you can store information in tree hierarchy like:
0000
0000
0000
etc
0001
0000
etc
Which you should allocate on demand. So the real question should be how to organize your storage effectively.
UPDATE
I want to note that in reality there is private API in Java (Oracle JDK, not OpenJDK), which can give you an opportunity to handle files more than 2 Gb, but it is private, is not a part of public API at all, so I wouldn't describe it here, without requests. You could find it directly in sun.nio.ch.FileChannelImpl (private map0, unmap0 methods).
Even if you had the software to do such things, it would be unusable at the scale you suggest since there doesn't exist a single machine with that much disk space.
So, since the main issue is the hardware limitations of a single machine, the solution would be to use a distributed computing framework that will allow you to scale out as much as needed. I suggest using https://ignite.apache.org/ as its incredibly flexible and has a pretty decent support here on stack overflow.
Coming at this from another perspective, you want to store IPv6 ip addresses. At the theoretical level, sure you will need 2^64 addresses. At the practical level, even if you attempted to index every IP out there today, you wouldn't significantly pass 2^32 since that is the number of IPv4s addresses and we are just passing that limit.
Yeah, this is 18.4467441 Exabytes which is a lot. You cannot store this in memory as there is no computer or even cluster with such memory (RAM).
Of course you can write to files. But these should definitely be multiple files. I don't think it is possible to have 1 such large file. And if it were possible, it would take hours or days to seek it. So there are 2 approaches:
Split in multiple smaller files
Use "streams" - read a bit, process, write and read next.
Maybe it is a silly observation, but did you think in serialize your data structure? There are many examples online, looking around I found this simple example that you could adjust to your tree, then you can do the conversion to store the data.

Can I get sound data as array?

I'm making program for Active Noise Control(also use Adaptive instead of Active / use Cancellation instead of Control)
System is pretty simple.
get sound via mic
turn the sound into data, which I can read(Something like Integer array)
make antiphase of the sound.
turn the data into sound file
Follwing is my question
Can I read sound as Integer Array?
If I can use Integer Array, how can I make antiphase? Just multiply -1 to every data?
Any useful think about my project
Is there any recommended language rather than java?
I heard that stackoverflow have many top class programmers. So, I expect for critical answer :D
Answering your questions:
(1) When you read sound, a byte array is returned. The bytes can readily be decoded into integers, shorts, floats, whatever. Java supports many common formats, and probably has one that matches your microphone input and speaker output. For example, Java supports 16-bit encoding, stereo, 44100 fps, which is considered the standard for CD-quality. There are several questions already at StackOverflow that show the coding for the decoding and recoding back to bytes.
(2) Yes, just multiply by -1 to every element of your PCM array. When you add the negative to the correctly lined up counterpart, 0 will result.
(3 & 4) I don't know what the tolerances are for lag time! I think if you simply take the input, decode, multiply by -1, recode, and output, it might be possible to get a very small amount of processing time. I don't know what Java is capable of here, but I bet it will be on the scale of a dozen millis, give or take. How much is enough for cancellation? How far does the sound travel from mike to speaker location? How much time does that allow? (Or am I missing something about how this works? I haven't done this sort of thing before.)
Java is pretty darn fast, and you will be relatively close to the native code level with the reading and writing and simple numeric conversions. The core code (for testing) could probably be written in an afternoon, using the following tutorial examples as a template: Reading/Writing sound files, see code snippets. I'd pay particular attention to the spot where the comment reads "Here do something useful with the audio data that is in the bytes array..." At this point,
you would put the code to convert the bytes to DSP, multiply by -1, then convert back to bytes.
If Java doesn't prove fast enough, I assume the next thing to try would be some flavor of C.

Is the output of a compression algorithm (for example gzip, zip, or snappy) definitely smaller than the input?

I have this question since I need to allocate the output buffer for the compressed data. I need to know how large the buIs the output of a compression algorithm (for example gzip, zip, or snappy) definitely smaller than the input?
For lossy compression algorithms it is possible for this to be the case, though not guaranteed. For lossless compression algorithms this is not the case - a lossless compression will always generate outputs that are larger than the input for some inputs. See this Wikipedia page for reasoning why.
there is always a fixed size associated with the "header", but for any real-life data (e.g. the length of this comment), compression will usually help.
That said, it is not "safe" to declare a post-compression buffer to be the same size as the input buffer. It might be bigger.
Compression libraries, such as zlib (for inflate/deflate used in gzip & pkzip) are more likely designed to process max N bytes from input and output max M bytes to user allocated buffer -- and signalling the caller if the library expects either new input data or new/cleared output buffer. Only rarely those libraries expect complete input and output residing in memory, but work on blocks.
Also the 'search windows' of many common algorithms are relatively small. This also limits the amount of required memory. Counter examples exists e.g. BWT used in tar.bz2.
And as other people have pointed out, the output of any lossless compression algorithm can be larger than the input, in which case most well designed compression libraries implement automatically a fallback mechanism, which just wraps an uncompressed block to a container with size information.
To summarize: many compression libraries just require a buffer from few kilobytes to few megabytes and process an input of any length with it. (Such constrains are btw included in MPEG -- in addition to the expected frame size (e.g. 128 kbps in mp3) they have specified the maximum required buffer size)
If you're using zlib (for gzip), you might find the following interface useful: (from zlib.h)
ZEXTERN uLong ZEXPORT compressBound OF((uLong sourceLen));
/*
compressBound() returns an upper bound on the compressed size after
compress() or compress2() on sourceLen bytes. It would be used before
a compress() or compress2() call to allocate the destination buffer.
*/
I believe that bzip has a similar interface. The value returned will be slightly larger than sourceLen, and should only be used if the data is compressing is small enough that you can do the compression in memory. For such applications, though, it's very useful.
Note that most of the time, you won't use most of the memory allocated, so you would also want to be able to return the unused memory if you are planning to keep the compressed version in memory for any length of time.
No, it is not.
A quick example: Data with uniformly distributed non-repeating values can not be compressed without loss, and so you end up with the original data, plus the attached meta data.

Java: multithreaded character stream decoding

I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:
Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
Multiple cores (We own a cheap Nehalem server with 16 hardware threads)
The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...
The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.
But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.
So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?
does anyone know how to parallelize the charset decoding operation in Java?
You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible).
How difficult this would be depends on the encoding you're decoding from. You will need a bespoke solution for the target encoding. If the encoding has a fixed width (e.g. Windows-1252), then one byte == one character and decoding is easy.
Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first).
Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially.
If it is an option, generate your data as UTF-16BE. Then you can cut out decoding and read two bytes straight to a char.
If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details.
It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) )
How do you know that? Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores?
Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed.
If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. But you will certainly need to implement your own decoders to do this.
Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. However, you want to ensure that the chunks overlap (with a parametrized size). If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. The bigger the overlap, the more processing required, and the smaller the probability of error. Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation.
If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned).
If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. In CSV, a sequence for newlines might make sense. Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding.

Handling large datasets in Java/Clojure: littleBig data

I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?
Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound
on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.
Can I change the maximum allowed heap
space if I am using Swank-Clojure (via
Leiningen) the JVM has on startup?
You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.
So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.
I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.
If I package this application (like I
plan to) as an Uberjar, would I be
able to ensure my JVM has some kind of
minimum heap space?
Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.
Can I "sample" from the file; e.g.
read only every z lines?
java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.
Would it be possible to read in only
parts of a large (text) file at a
time, so I could import and process
the data in "chunks", e.g, n lines at
a time? If so, how?
line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.
Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)
Is there some faster way of accessing
the file I'd be reading from
(potentially rapidly, depending on the
implementation), other than simply
reading from it a bit at a time?
Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.
To make the most out of the memory you have, keep the data as primitive as possible.
For some actual numbers, let's assume you want to graph the contents of a music CD:
A CD has two channels, each with 44,100 samples per second
60 min. of music is then ~300 million data points
Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
Represented as primitive int array (4 bytes per datapoint): 1.2GB
Represented as Integer array (32 bytes per datapoint): 10GB
Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).
Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.
If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.
So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.
Update:
On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.
The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)
Tossing out a couple ideas from left field...
You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/
Or perhaps memory-mapped I/O.
A couple of thoughts:
Best way to handle large in-memory data sets in Java/Clojure is to use large primitive arrays. If you do this, you are basically using only a little more memory than the size of the underlying data. You handle these arrays in Clojure just fine with the aget/aset functionality
I'd be tempted to downsample, but maintain a way to lazily access the detailed points "on demand" if you need to, e.g. in the user interaction case. Kind of like the way that Google maps lets you see the whole world, and only loads the detail when you zoom in....
If you only care about the output image from the x-y plot, then you can construct it by loading in a few thousand points at a time (e.g. loading into your primitive arrays), plotting them then discarding then. In this way you won't need to hold the full data set in memory.

Categories