What is the best multi-part base 64 encoder in java? - java

I have tested different base64 encoders mig64,iHarder,sun etc. Seems like these need to have the whole data need to be in memory for conversion.
If I want to encode a large file (stream) > 1gb in a multi-threaded fashion, which codec implementation can be used without corrupting the file? commons codec seems to have the base64outputstream wrapper. any other solutions?
To make it clear, I have a 1TB file, and this file need to be encoded base64. Machine memory 2GB RAM, what is the fastest way to do it in Java?

I'm not sure which encoder is faster offhand, you'll have to measure each to determine that. However you can avoid the memory problem and accomplish the concurrency by splitting the file into chunks. Just make sure you split them on some 6-byte boundary (since it evenly turns into 8 bytes in Base64).
I'd recommend picking a reasonable chunk size and using an ExecutorService to manage a fixed number of threads to do the processing. You can share a RandomAccessFile between them and write to the appropriate places. You'll of course have to calculate the output chunk offsets (just multiple by 8 and divide by 6).
Honestly though you might not realize much performance gain here with concurrency. It could just overwhelm the hard drive with random access. I'd start with chunking the file up using a single thread. See how fast that is first. You can probably crunch a 1GB file faster than you think. As a rough guess I'd say 1 minute on modern hardware, even writing to the same drive you're reading from.

Related

Patterns for efficient reading from Java MemorySegment

I am working on using Java for reading of (potentially) large amounts of data from (potentially) large files - the scenario is uncompressed imagery from a file format like HEIF. Larger than 2G is likely. Writing is a future need, but this question is scoped to reading.
The HEIF format (which is derived from ISO Base Media File Format - ISO/IEC 14496-12) is variable sizes "boxes" - you read the length and kind of box, and do some parsing thing appropriate to the box. In my design, I'll parse out the small-ish boxes, and keep references to the bulk storage (mdat) offsets to be able to pull the data out for rendering / processing as requested.
I'm considering two options - multiple MappedByteBuffer (since that is 2G limited), and a single MemorySegment (from a memory mapped file). Its not clear to me which is likely to be more efficient. The MappedByteBuffer has all the nice ByteBuffer API, but I need to manage multiple entities. The MemorySegment will be a single entry, but it looks I'll need to create slice views to get anything I can read from (e.g. a byte array or ByteBuffer), which looks like a different version of the same problem. A secondary benefit for the MemorySegment is that it may lead to a nicer design when I need to use some other non-Java API (like feeding the imagery into a hardware encoder for compression). I also have the skeleton of the MemorySegment implemented and reading (just with some gross assumptions that I can turn it into a single ByteBuffer).
Are there emerging patterns for efficient reading from a MemorySegment? Failing that, is there something I'm missing in the MemorySegment API?

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Improving performance of protocol buffers

I'm writing an application that needs to deserialize quickly millions of messages from a single file.
What the application does is essentially to get one message from the file, do some work and then throw away the message. Each message is composed of ~100 fields (not all of them are always parsed but I need them all because the user of the application can decide on which fields he wants to work on).
In this moment the application consists in a loop that in each iteration just executes using a readDelimitedFrom() call.
Is there a way to optimize the problem to fit better this case (splitting in multiple files, etc...). In addition, in this moment due to the number of messages and the dimension of each message, I need to gzip the file (and it is fairly effective in reducing the size since the value of the fields are quite repetitive) - this though reduces the performance.
If CPU time is your bottleneck (which is unlikely if you are loading directly from HDD with cold cache, but could be the case in other scenarios), then here are some ways you can improve throughput:
If possible, use C++ rather than Java, and reuse the same message object for each iteration of the loop. This reduces the amount of time spent on memory management, as the same memory will be reused each time.
Instead of using readDelimitedFrom(), construct a single CodedInputStream and use it to read multiple messages like so:
// Do this once:
CodedInputStream cis = CodedInputStream.newInstance(input);
// Then read each message like so:
int limit = cis.pushLimit(cis.readRawVarint32());
builder.mergeFrom(cis);
cis.popLimit(limit);
cis.resetSizeCounter();
(A similar approach works in C++.)
Use Snappy or LZ4 compression rather than gzip. These algorithms still get reasonable compression ratios but are optimized for speed. (LZ4 is probably better, though Snappy was developed by Google with Protobufs in mind, so you might want to test both on your data set.)
Consider using Cap'n Proto rather than Protocol Buffers. Unfortunately, there is no Java version yet, but EDIT: There is capnproto-java and also implementations in many other languages. In the languages it supports it has been shown to be quite a bit faster. (Disclosure: I am the author of Cap'n Proto. I am also the author of Protocol Buffers v2, which is the version Google released open source.)
I expect that the majority of the time spent by your CPU is in garbage collection. I would look to replace the default garbage collector with one better suited for your use case of short lived objects.
If you do decide to write this in C++ - use an Arena to create the first message before parsing: https://developers.google.com/protocol-buffers/docs/reference/arenas

How should I manage memory in mobile audio mixing software?

I'm toying around with creating a pure Java audio mixing library, preferably one that can be used with Android, not entirely practical but definitely an interesting thing to have. I'm sure it's been done already, but just for my own learning experience I am trying to do this with wav files since there are usually no compression models to work around.
Given the nature of java.io, it defines many InputStream type of classes. Each implements operations that are primarily for reading data from some underlying resource. What you do with data afterward, dump it or aggregate it in your own address space, etc, is up to you. I want this to be purely Java, e.g. works on anything (no JNI necessary), optimized for low memory configurations, and simple to extend.
I understand the nature of the RIFF format and how to assemble the PCM sample data, but I'm at a loss for the best way of managing the memory required for inflating the files into memory. Using a FileInputStream, only so much of the data is read at a time, based on the underlying file system and how the read operations are invoked. FileInputStream doesn't furnish a method of indexing where in the file you are so that retrieving streams for mixing later is not possible. My goal would be to inflate the RIFF document into Java objects that allow for reading and writing of the appropriate regions of the underlying chunk.
If I allocate space for the entire thing, e.g. all PCM sample data, that's like 50 MB per average song. On a typical smart phone or tablet, how likely is it that this will affect overall performance? Would I be better off coming up with my own InputStream type that maybe keeps track of where the chunks are in the InputStream? For file's this will result in lots of blocking when fetching PCM samples, but will still cut down on the overall memory footprint on the system.
I'm not sure I understand all of your question, but I'll answer what I can. Feel free to clarify in the comments, and I'll edit.
Don't keep all file data in in memory for a DAW-type app, or any file/video player that expects to play large files. This might work on some devices depending on the memory model, but you are asking for trouble.
Instead, read the required section of the file as needed (ie on demand). It'a actually a bit more complex than that because you don't want to read the file in the audio playback thread (you don't want audio playback, which is low latency, to depend on file IO, which is high-latency). To get around that, you may have to buffer some of the file in advance. (it depends on whether you are using a callback or blocking model)
Using FileInputStream works fine, you'll just have to keep track of where everything is in the file yourself (this involves converting milliseconds or whatever to samples to bytes and taking into account the size of the header[1]). A slightly better option is RandomAccessFile because it allows you to jump arround.
My slides from a talk on programing audio software might help, especially if you are confused by callback v blocking: http://blog.bjornroche.com/2011/11/slides-from-fundamentals-of-audio.html
[1] or, more correctly, knowing the offset of the audio data in the file.

Java: multithreaded character stream decoding

I am maintaining a high performance CSV parser and try to get the most out of latest technology to improve the throughput. For this particular tasks this means:
Flash memory (We own a relatively inexpensive PCI-Express card, 1 TB of storage that reaches 1 GB/s sustained read performance)
Multiple cores (We own a cheap Nehalem server with 16 hardware threads)
The first implementation of the CSV parser was single threaded. File reading, character decoding, field splitting, text parsing, all within the same thread. The result was a throughput of about 50MB/s. Not bad but well below the storage limit...
The second implementation uses one thread to read the file (at the byte level), one thread to decode the characters (from ByteBuffer to CharBuffer), and multiple threads to parse the fields (I mean parsing delimitted text fields into doubles, integers, dates...). This works well faster, close to 400MB/s on our box.
But still well below the performance of our storage. And those SSD will improve again in the future, we are not taking the most out of it in Java. It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) ). That is the bottleneck, on a powerful Nehalem processor it transforms bytes into chars at 400MB/s, pretty good, but this has to be single threaded. The CharsetDecoder is somewhat stateful, depending on the used charset, and does not support multithreaded decoding.
So my question to the community is (and thank you for reading the post so far): does anyone know how to parallelize the charset decoding operation in Java?
does anyone know how to parallelize the charset decoding operation in Java?
You might be able to open multiple input streams to do this (I'm not sure how you'd go about this with NIO, but it must be possible).
How difficult this would be depends on the encoding you're decoding from. You will need a bespoke solution for the target encoding. If the encoding has a fixed width (e.g. Windows-1252), then one byte == one character and decoding is easy.
Modern variable-width encodings (like UTF-8 and UTF-16) contain rules for identifying the first byte of a character sequence, so it is possible to jump to the middle of a file and start decoding (you'll have to note the end of the previous chunk, so it is wise to start decoding the end of the file first).
Some legacy variable-width encodings might not be this well designed, so you'll have no option but to decode from the start of the data and read it sequentially.
If it is an option, generate your data as UTF-16BE. Then you can cut out decoding and read two bytes straight to a char.
If the file is Unicode, watch out for BOM handling, but I'm guessing you're already familiar with many of the low-level details.
It is clear that the current limitation is the character decoding ( CharsetDecoder.read(...) )
How do you know that? Does your monitoring / profiling show conclusively that the decoder thread is using 100% of one of your cores?
Another possibility is that the OS is not capable of driving the SSD at its theoretical maximum speed.
If UTF-8 decoding is definitely the bottleneck then it should be possible to do the task in parallel. But you will certainly need to implement your own decoders to do this.
Another (crazy) alternative would be to just separate the input into chunks of some arbitrary size, ignore the decoding issues and then decode each of the chunks in parallel. However, you want to ensure that the chunks overlap (with a parametrized size). If the overlapping region of the two chunks is decoded the same way by the two threads (and your overlap was big enough for the specified encoding) it should be safe to join the results. The bigger the overlap, the more processing required, and the smaller the probability of error. Furthermore, if you are in a situation where you know the encoding is UTF-8, or a similarly simple encoding, you could set the overlap quite low (for that client) and still be guaranteed correct operation.
If the second chunk turns out to be wrong, you will have to redo it, so it is important to not do to big chunks in parallel. If you do more than two chunks in parallel, it would be important to 'repair' from beginning to end, so that one misaligned block does not result in invalidating the next block (which might be correctly aligned).
If you know the encoding, and it is either fixed size, or does not contain overlapping byte sequences, you could scan for a special sequence. In CSV, a sequence for newlines might make sense. Even if you dynamically detect the encoding, you could run a pass of the first few bytes to determine encoding, and then move on to parallel decoding.

Categories