I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:
Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
Multiple cores (We own a cheap Nehalem server with 16 hardware threads)
The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...
The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.
But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.
So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?
does anyone know how to parallelize the charset decoding operation in Java?
You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible).
How difficult this would be depends on the encoding you're decoding from. You will need a bespoke solution for the target encoding. If the encoding has a fixed width (e.g. Windows-1252), then one byte == one character and decoding is easy.
Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first).
Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially.
If it is an option, generate your data as UTF-16BE. Then you can cut out decoding and read two bytes straight to a char.
If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details.
It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) )
How do you know that? Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores?
Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed.
If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. But you will certainly need to implement your own decoders to do this.
Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. However, you want to ensure that the chunks overlap (with a parametrized size). If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. The bigger the overlap, the more processing required, and the smaller the probability of error. Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation.
If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned).
If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. In CSV, a sequence for newlines might make sense. Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding.
Related
I am working on using Java for reading of (potentially) large amounts of data from (potentially) large files - the scenario is uncompressed imagery from a file format like HEIF. Larger than 2G is likely. Writing is a future need, but this question is scoped to reading.
The HEIF format (which is derived from ISO Base Media File Format - ISO/IEC 14496-12) is variable sizes "boxes" - you read the length and kind of box, and do some parsing thing appropriate to the box. In my design, I'll parse out the small-ish boxes, and keep references to the bulk storage (mdat) offsets to be able to pull the data out for rendering / processing as requested.
I'm considering two options - multiple MappedByteBuffer (since that is 2G limited), and a single MemorySegment (from a memory mapped file). Its not clear to me which is likely to be more efficient. The MappedByteBuffer has all the nice ByteBuffer API, but I need to manage multiple entities. The MemorySegment will be a single entry, but it looks I'll need to create slice views to get anything I can read from (e.g. a byte array or ByteBuffer), which looks like a different version of the same problem. A secondary benefit for the MemorySegment is that it may lead to a nicer design when I need to use some other non-Java API (like feeding the imagery into a hardware encoder for compression). I also have the skeleton of the MemorySegment implemented and reading (just with some gross assumptions that I can turn it into a single ByteBuffer).
Are there emerging patterns for efficient reading from a MemorySegment? Failing that, is there something I'm missing in the MemorySegment API?
I have this question since I need to allocate the output buffer for the compressed data. I need to know how large the buIs the output of a compression algorithm (for example gzip, zip, or snappy) definitely smaller than the input?
For lossy compression algorithms it is possible for this to be the case, though not guaranteed. For lossless compression algorithms this is not the case - a lossless compression will always generate outputs that are larger than the input for some inputs. See this Wikipedia page for reasoning why.
there is always a fixed size associated with the "header", but for any real-life data (e.g. the length of this comment), compression will usually help.
That said, it is not "safe" to declare a post-compression buffer to be the same size as the input buffer. It might be bigger.
Compression libraries, such as zlib (for inflate/deflate used in gzip & pkzip) are more likely designed to process max N bytes from input and output max M bytes to user allocated buffer -- and signalling the caller if the library expects either new input data or new/cleared output buffer. Only rarely those libraries expect complete input and output residing in memory, but work on blocks.
Also the 'search windows' of many common algorithms are relatively small. This also limits the amount of required memory. Counter examples exists e.g. BWT used in tar.bz2.
And as other people have pointed out, the output of any lossless compression algorithm can be larger than the input, in which case most well designed compression libraries implement automatically a fallback mechanism, which just wraps an uncompressed block to a container with size information.
To summarize: many compression libraries just require a buffer from few kilobytes to few megabytes and process an input of any length with it. (Such constrains are btw included in MPEG -- in addition to the expected frame size (e.g. 128 kbps in mp3) they have specified the maximum required buffer size)
If you're using zlib (for gzip), you might find the following interface useful: (from zlib.h)
ZEXTERN uLong ZEXPORT compressBound OF((uLong sourceLen));
/*
compressBound() returns an upper bound on the compressed size after
compress() or compress2() on sourceLen bytes. It would be used before
a compress() or compress2() call to allocate the destination buffer.
*/
I believe that bzip has a similar interface. The value returned will be slightly larger than sourceLen, and should only be used if the data is compressing is small enough that you can do the compression in memory. For such applications, though, it's very useful.
Note that most of the time, you won't use most of the memory allocated, so you would also want to be able to return the unused memory if you are planning to keep the compressed version in memory for any length of time.
No, it is not.
A quick example: Data with uniformly distributed non-repeating values can not be compressed without loss, and so you end up with the original data, plus the attached meta data.
I need to parse (and transform and write) a large binary file (larger than memory) in Java. I also need to do so as efficiently as possible in a single thread. And, finally, the format being read is very structured, so it would be good to have some kind of parser library (so that the code is close to the complex specification).
The amount of lookahead needed for parsing should be small, if that matters.
So my questions are:
How important is nio v io for a single threaded, high volume application?
Are there any good parser libraries for binary data?
How well do parsers support streaming transformations (I want to be able to stream the data being parsed to some output during parsing - I don't want to have to construct an entire parse tree in memory before writing things out)?
On the nio front my suspicion is that nio isn't going to help much, as I am likely disk limited (and since it's a single thread, there's no loss in simply blocking). Also, I suspect io-based parsers are more common.
Let me try to explain if and how Preon addresses all of the concerns you mention:
I need to parse (and transform and write) a large binary file (larger
than memory) in Java.
That's exactly why Preon was created. You want to be able to process the entire file, without loading it into memory entirely. Still, the program model gives you a pointer to a data structure that appears to be in memory entirely. However, Preon will try to load data as lazily as it can.
To explain what that means, imagine that somewhere in your data structure, you have a collection of things that are encoded in a binary representation with a constant size; say that every element will be encoded in 20 bytes. Then Preon will first of all not load that collection in memory at all, and if you're grabbing data beyond that collection, it will never touch that region of your encoded representation at all. However, if you would pick the 300th element of that collection, it would (instead of decoding all elements up to the 300th element), calculate the offset for that element, and jump there immediately.
From the outside, it is as though you have a reference to a list that is fully populated. From the inside, it only goes out to grab an element of the list if you ask for it. (And forget about it immediately afterward, unless you instruct Preon to do things differently.)
I also need to do so as efficiently as possible in a single thread.
I'm not sure what you mean by efficiently. It could mean efficiently in terms of memory consumption, or efficiently in terms of disk IO, or perhaps you mean it should be really fast. I think it's fair to say that Preon aims to strike a balance between an easy programming model, memory use and a number of other concerns. If you really need to traverse all data in a sequential way, then perhaps there are ways that are more efficient in terms of computational resources, but I think that would come at the cost of "ease of programming".
And, finally, the format being read is very structured, so it would be
good to have some kind of parser library (so that the code is close to
the complex specification).
The way I implemented support for Java byte code, is to just read the byte code specification, and then map all of the structures they mention in there directly to Java classes with annotations. I think Preon comes pretty close to what you're looking for.
You might also want to check out preon-emitter, since it allows you to generate annotated hexdumps (such as in this example of the hexdump of a Java class file) of your data, a capability that I haven't seen in any other library. (Hint: make sure you hover with your mouse over the hex numbers.)
The same goes for the documentation it generates. The aim has always been to mak sure it creates documentation that could be posted to Wikipedia, just like that. It may not be perfect yet, but I'm not unhappy with what it's currently capable of doing. (For an example: this is the documentation generated for Java's class file specification.)
The amount of lookahead needed for parsing should be small, if that matters.
Okay, that's good. In fact, that's even vital for Preon. Preon doesn't support lookahead. It does support looking back though. (That is, sometimes part the encoding mechanism is driven by data that was read before. Preon allows you to declare dependencies that point back to data read before.)
Are there any good parser libraries for binary data?
Preon! ;-)
How well do parsers support streaming transformations (I want to be
able to stream the data being parsed to some output during parsing - I
don't want to have to construct an entire parse tree in memory before
writing things out)?
As I outlined above, Preon does not construct the entire data structure in memory before you can start processing it. So, in that sense, you're good. However, there is nothing in Preon supporting transformations as first class citizens, and it's support for encoding is limited.
On the nio front my suspicion is that nio isn't going to help much, as
I am likely disk limited (and since it's a single thread, there's no
loss in simply blocking). Also, I suspect io-based parsers are more
common.
Preon uses NIO, but only it's support for memory mapped files.
On NIO vs IO you are right, going with IO should be the right choice - less complexity, stream oriented etc.
For a binary parsing library - checkout Preon
Using a Memory Mapped File you can read through it without worrying about your memory and it's fast.
I think you are correct re NIO vs IO unless you have little endian data as NIO can read little endian natively.
I am not aware of any fast binary parsers, generally you want to call the NIO or IO directly.
Memory mapped files can help with writing from a single thread as you don't have to flush it as you write. (But it can be more cumbersome to use)
You can stream the data how you like, I don't forsee any problems.
I have tested different base64 encoders mig64,iHarder,sun etc. Seems like these need to have the whole data need to be in memory for conversion.
If I want to encode a large file (stream) > 1gb in a multi-threaded fashion, which codec implementation can be used without corrupting the file? commons codec seems to have the base64outputstream wrapper. any other solutions?
To make it clear, I have a 1TB file, and this file need to be encoded base64. Machine memory 2GB RAM, what is the fastest way to do it in Java?
I'm not sure which encoder is faster offhand, you'll have to measure each to determine that. However you can avoid the memory problem and accomplish the concurrency by splitting the file into chunks. Just make sure you split them on some 6-byte boundary (since it evenly turns into 8 bytes in Base64).
I'd recommend picking a reasonable chunk size and using an ExecutorService to manage a fixed number of threads to do the processing. You can share a RandomAccessFile between them and write to the appropriate places. You'll of course have to calculate the output chunk offsets (just multiple by 8 and divide by 6).
Honestly though you might not realize much performance gain here with concurrency. It could just overwhelm the hard drive with random access. I'd start with chunking the file up using a single thread. See how fast that is first. You can probably crunch a 1GB file faster than you think. As a rough guess I'd say 1 minute on modern hardware, even writing to the same drive you're reading from.
I currently have the following array in a Java program,
byte[] data = new byte[800];
and I'd like to compress it before sending it to a microcontroller over serial (115200 Baud). I would like to then decompress the array on the microcontroller in C. However, I'm not quite sure what the best way to do this is. Performance is an issue since the microcontroller is just an arduino so it can't be too memory/cpu intensive. The data is more or less random (edit I guess it's not really that random, see the edit below) I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
edit
Sorry about the lack of info. I need the compression to be lossless and I do only intend to send 800 bytes at a time. My issue is that 800 bytes won't transfer fast enough at the rate of 115200 baud that I am using. I was hoping I could shrink the size a little bit to improve speed.
Every two bytes looks like:
0RRRRRGGGGGBBBBB
Where R G and B bits represent the values for color channels red, green, and blue respectively. Every two bytes is then an individual LED on a 20x20 grid. I would imagine that many sets of two bytes would be identical since I frequently assign the same color codes to multiple LEDs. It may also be the case that RGB values are often > 15 since I typically use bright colors when I can (However, this might be a moot point since they are not all typically > 15 at once).
If the data is "more or less random" then you won't have much luck compressing it, I'm afraid.
UPDATE
Given the new information, I bet you don't need 32k colours on your LED display. I'd imagine that a 1024- or 256-colour palette might be sufficient. Hence you could get away with a trivial compression scheme (simply map each word through a lookup table, or possibly just discard lsbs of each component), that would work even for completely uncorrelated pixel values.
Use miniLZO compression. Java version C version
A really simple compression/decompression algorithm that is practical in tiny embedded environments and is easy to "roll your own" is run length encoding. Basically this means replacing a run of duplicate values with a (count, value) pair. Of course you need a sentinel (magic) value to introduce the pair, and then a mechanism to allow the magic value to appear in normal data (typically an escape sequence can be used for both jobs). In your example it might be best to use 16 bit values (2 bytes).
But naturally it all depends on the data. Data that is sufficiently random is incompressible by definition. You would do best to collect some example data first, then evaluate your compression options.
Edit after extra information posted
Just for fun and to show how easy run length encoding is I have coded up something. I'm afraid I've used C for compression as well, since I'm not a Java guy. To keep things simple I've worked entirely with 16 bit data. An optimization would be to use an 8 bit count in the (count,value) pair. I haven't tried to compile or test this code. See also my comment to your question about the possible benefits of mangling the LED addresses.
#define NBR_16BIT_WORDS 400
typedef unsigned short uint16_t;
// Return number of words written to dst (always
// less than or equal to NBR_16BIT_WORDS)
uint16_t compress( uint16_t *src, uint16_t *dst )
{
uint16_t *end = (src+NBR_16BIT_WORDS);
uint16_t *dst_begin = dst;
while( src < end )
{
uint16_t *temp;
uint16_t count=1;
for( temp=src+1; temp<end; temp++ )
{
if( *src == *temp )
count++;
else
break;
}
if( count < 3 )
*dst++ = *src++;
else
{
*dst++ = (*src)|0x8000;
*dst++ = count;
*src += count;
}
}
return dst-dst_begin;
}
void decompress( uint16_t *src, uint16_t *dst )
{
uint16_t *end_src = (src+NBR_16BIT_WORDS);
uint16_t *end_dst = (dst+NBR_16BIT_WORDS);
while( src<end_src && dst<end_dst )
{
data = *src++;
if( (data&0x8000) == 0 )
*dst++ = data;
else
{
data &= 0x7fff;
uint16_t count = *src++;
while( dst<end_dst && count-- )
*dst++ = data;
}
}
}
One of the first things to do would be to convert from RGB to YUV, or YCrCb, or something on that order. Having done that, you can usually get away with sub-sampling the U and V (or Cr/Cb) channels to half resolution. This is quite common in most types of images (e.g., JPEG, and MPEG both do it, and so do the sensors in most digital cameras).
Realistically, starting with only 800 bytes of data, most other forms of compression are going to be a waste of time and effort. You're going to have to put in quite a bit of work before you accomplish much (and keeping it reasonably fast on a Arduino won't be trivial for either).
Edit: okay, if you're absolutely certain you can't modify the data at all, things get more difficult very quickly. The real question at that point is what kind of input you're dealing with. Others have already mentioned the possibility of something on the order of a predictive delta compression -- e.g., based on preceding pixels, predict what the next one is likely to be, and then encode only the difference between the prediction and the actual value. Getting the most out of that, however, generally requires running the result through some sort of entropy-based algorithm like Shannon-Fanno or Huffman compression. Those, unfortunately, aren't usually the fastest to decompress though.
If your data is most things like charts or graphs, where you can expect to have large areas of identical pixels, run-length (or run-end) encoding can work pretty well. This does have the advantage of being really trivial to decompress as well.
I doubt that LZ-based compression is going to work so well though. LZ-based compression works (in general) by building a dictionary of strings of bytes that have been seen, and when/if the same string of bytes is seen again, transmitting the code assigned to the previous instance instead of re-transmitting the entire string. The problem is that you can't transmit uncompressed bytes -- you start out by sending the code word that represents that byte in the dictionary. In your case, you could use (for example) a 10-bit code word. This means the first time you send any particularly character, you need to send it as 10 bits, not just 8. You only start to get some compression when you can build up some longer (two-byte, three-byte, etc.) strings in your dictionary, and find a matching string later in the input.
This means LZ-based compression usually gets fairly poor compression for the first couple hundred bytes or so, then about breaks even for a while, and only after it's been running across some input for a while does it really start to compress well. Dealing with only 800 bytes at a time, I'm not at all sure you're ever going to see much compression -- in fact, working in such small blocks, it wouldn't be particularly surprising to see the data expand on a fairly regular basis (especially if it's very random).
The data is more or less random I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
Ideally you can compress 800 bytes of colour data to one byte if the whole image is the same colour. As Oli Charlesworth mentions however, the more random the data, the less you can compress it. If your images looks like static on a TV, then indeed, good luck getting any compression out of it.
Definitely consider Oli Charlesworth's answer. On a 20x20 grid, I don't know if you need a full 32k color palette.
Also, in your earlier question, you said you were trying to run this on a 20ms period (50 Hz). Do you really need that much speed for this display? At 115200 bps, you can transmit ~11520 bytes/sec - call it 10KBps for a margin of safety (e.g. your micro might have a delay between bytes, you should do some experiments to see what the 'real' bandwidth is). At 50 Hz, this only allows you about 200 bytes per packet - you're looking for a compression ratio over 75%, which may not be attainable under any circumstances. You seem pretty married to your requirements, but it may be time for an awkward chat.
If you do want to go the compression route, you will probably just have to try several different algorithms with 'real' data, as others have said, and try different encodings. I bet you can find some extra processing time by doing matrix math, etc. in between receiving bytes over the serial link (you'll have about 80 microseconds between bytes) - if you use interrupts to read the serial data instead of polling, you can probably do pretty well by using a double buffer and processing/displaying the previous buffer while reading into the current buffer.
EDIT:
Is it possible to increase the serial port speed beyond 115200? This USB-serial adapter at Amazon says it goes up to 1 Mbps (probably actually 921600 bps). Depending on your hardware and environment, you may have to worry about bad data, but if you increase the speed enough, you could probably add a checksum, and maybe even limited error correction.
I'm not familiar with the Arduino, but I've got an 8-bit FreeScale HCS08 I drive at 1.25 Mbps, although the bus is actually running RS-485, not RS-232 (485 uses differential signaling for better noise performance), and I don't have any problems with noise errors. You might even consider a USB RS-485 adapter, if you can wire that to your Arduino (you'd need conversion hardware to change the 485 signals to the Arduino's levels).
EDIT 2:
You might also consider this USB-SPI/I2C adapter, if you have an available I2C or SPI interface, and you can handle the wiring. It says it can go to 400 kHz I2C or 200 kHz SPI, which is still not quite enough by itself, but you could split the data between the SPI/I2C and the serial link you already have.
LZ77/78 are relatively easy to write http://en.wikipedia.org/wiki/LZ77_and_LZ78
However given the small amount of data you're transferring, its probably not worth compressing it at all.