GZIPOutputStream that does its compression in a separate thread - java

Is there an implemetation of GZIPOutputStream that would do the heavy lifting (compressing + writing to disk) in a separate thread?
We are continuously writing huge amounts of GZIP-compressed data. I am looking for a drop-in replacement that could be used instead of GZIPOutputStream.

You can write to a PipedOutputStream and have a thread which reads the PipedInputStream and copies it to any stream you like.
This is a generic implementation. You give it an OutputStream to write to and it returns an OutputStream for you to write to.
public static OutputStream asyncOutputStream(final OutputStream out) throws IOException {
PipedOutputStream pos = new PipedOutputStream();
final PipedInputStream pis = new PipedInputStream(pos);
new Thread(new Runnable() {
#Override
public void run() {
try {
byte[] bytes = new byte[8192];
for(int len; (len = pis.read(bytes)) > 0;)
out.write(bytes, 0, len);
} catch(IOException ioe) {
ioe.printStackTrace();
} finally {
close(pis);
close(out);
}
}
}, "async-output-stream").start();
return pos;
}
static void close(Closeable closeable) {
if (closeable != null) try {
closeable.close();
} catch (IOException ignored) {
}
}

I published some code that does exactly what you are looking for. It has always frustrated me that Java doesn't automatically pipeline calls like this across multiple threads, in order to overlap computation, compression, and disk I/O:
https://github.com/lukehutch/PipelinedOutputStream
This class splits writing to an OutputStream into separate producer and consumer threads (actually, starts a new thread for the consumer), and inserts a blocking bounded buffer between them. There is some data copying between buffers, but this is done as efficiently as possible.
You can even layer this twice to do the disk writing in a separate thread from the gzip compression, as shown in README.md.

Related

Read from ByteArrayOutputStream while it's being written to

I have a class that is constantly producing data and writing it to a ByteArrayOutputStream on its own thread. I have a 2nd thread that gets a reference to this ByteArrayOutputStream. I want the 2nd thread to read any data (and empty) the ByteArrayOutputStream and then stop when it doesn't get any bytes and sleep. After the sleep, I want it to try to get more data and empty it again.
The examples I see online say to use PipedOutputStream. If my first thread is making the ByteArrayOutputStream available to the outside world from a separate reusable library, I don't see how to hook up the inputStream to it.
How would one setup the PipedInputStream to connect it to the ByteArrayOutputStream to read from it as above? Also, when reading the last block from the ByteArrayOutputStream, will I see bytesRead == -1, indicating when the outputStream is closed from the first thread?
Many thanks,
Mike
Write to the PipedOutputStream directly (that is, don't use a ByteArrayOutputStream at all). They both extend OutputStream and so have the same interface.
There are connect methods in both PipedOutputStream and PipedInputStream that are used to wire two pipes together, or you can use one of the constructors to create a pair.
Writes to the PipedOutputStream will block when the buffer in the PipedInputStream fills up, and reads from the PipedInputStream will block when the buffer is empty, so the producer thread will sleep (block) if it gets "ahead" of the consumer and vice versa.
After blocking the threads wait for 1000ms before rechecking the buffer, so it's good practice to flush the output after writes complete (this will wake the reader if it is sleeping).
Your input stream will see the EOF (bytesRead == -1) when you close the output stream in the producer thread.
import java.io.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class PipeTest {
public static void main(String[] args) throws IOException {
PipedOutputStream out = new PipedOutputStream();
// Wire an input stream to the output stream, and use a buffer of 2048 bytes
PipedInputStream in = new PipedInputStream(out, 2048);
ExecutorService executor = Executors.newCachedThreadPool();
// Producer thread.
executor.execute(() -> {
try {
for (int i = 0; i < 10240; i++) {
out.write(0);
// flush to wake the reader
out.flush();
}
out.close();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
});
// Consumer thread.
executor.execute(() -> {
try {
int b, read = 0;
while ((b = in.read()) != -1) {
read++;
}
System.out.println("Read " + read + " bytes.");
} catch (IOException e) {
throw new UncheckedIOException(e);
}
});
executor.shutdown();
}
}

Bounded Stream like implementation in java

Goal:
I am looking for a bounded byte stream like implementation in java to have "write" and "read" functionality.
The write should be non-blocking and should drop if the capacity has been reached.
Ideally, the max size of the stream should be given in the beginning so that dynamic memory allocation overhead is not present.
What I have tried:
I have shared data structure using the following buffer class, shared between the producer and consumer:
public class Buffer<T> {
private ArrayBlockingQueue<T> buffer;
public Buffer(int maxSize) {
buffer = new ArrayBlockingQueue<T>(maxSize);
}
public boolean offer(T o) {
return buffer.offer(o);
}
public T take(){
try {
return buffer.take();
} catch (InterruptedException e) {
e.printStackTrace();
}
return null;
}
public static void main(String[] args){
Buffer<String> buffer = new Buffer<String>(10);
buffer.offer("hello");
System.out.println(buffer.take());
}
}
A Producer in Thread 1 which reads from a socket and sends data as follows:
private InputStream in;
while ( (bytesRead = in.read(request)) != -1) {
this.register.buffer.offer(new ByteObject(request,bytesRead)
}
A Consumer in Thread 2 which reads from the buffer and writes to a socket:
private OutputStream out;
while(true){
ByteObject poll = register.buffer.take();
bytesRead = poll.len;
if(bytesRead>0){
out.write(poll.arr,0,bytesRead);
out.flush();
}else
break;
}
Problem with current implementation:
Since it is not a stream implementation, I am unable to detect EOF in the consumer. Can anyone give an example for Bounded Stream operations in java which give a non-blocking write option (like offer() functionality)
Alternatively, how would I get EOF in the current code?

writing to OutputStream having capacity restriction

Following the question I asked before: I am implementing an ByteArrayOutputStream having capacity restriction. My main limitation is an amount of available memory. So having such stream os:
When I write more than say 1MB to the output stream I need to "stop".
I prefer not throw exception but write the complete contents of os
output stream to the specified other output stream argument.
OutputStream out;
os.writeTo(out);
And after that continue the writings to os from its beginning
In order to prevent the situation described at 1. , I prefer to drain os,
as freuqntely as possible. I mean copy the data from it to out in chuncks
of 512KB
Is it feasible? If yes any advices how can it be done? Or may be there is a built in class which answers my requirements
Edit: The amount of bytes written to out is also limited. I can write there up to 1GB. If I have more, I need to create other output stream in order to drain from os there.
The proccess of writing to os. can be like that. 500MB was written there - I transfer it immidiately to out. After several seconds 700MB were written there - I need to drain only 500MB to out and other 200MB to other outputstream(out2), which I`ll need to create upon such situation
What you are describing is a BufferedOutputStream, which you can construct like that :
new BufferedOutputStream(out, 512000)
The first arg is the other outputstream you have and the second one is the size of the BufferedOutputStream internal buffer
EDIT:
ok, i did not fully understand your need at first. You will indeed need to extend OutputStream to achieve that. Here is a sample code :
Here is how to use the below code :
public static void main(String[] args) throws IOException {
AtomicLong idx = new AtomicLong(0);
try (
OutputStream out = new OutputStreamMultiVolume(10, () -> new FileOutputStream(getNextFilename(idx)));
) {
out.write("01234567890123456789012345678901234567890123456789".getBytes("UTF-8"));
}
}
private static File getNextFilename(AtomicLong idx) {
return new File("sample.file." + idx.incrementAndGet() + ".txt");
}
The first constructor arg of OutputStreamMultiVolume is the max size of a volume. If we reach this size, we will close the current outputStream, and call the OutputStreamSupplier to get the next one.
The example code here will write the String 01234567890123456789012345678901234567890123456789 (5 times 0123456789) to files named 'sample.file.idx.txt' where idx is increased each time we reach the outstream max size (so you'll get 5 files).
and the class intself :
public class OutputStreamMultiVolume extends OutputStream {
private final long maxBytePerVolume;
private long bytesInCurrentVolume = 0;
private OutputStream out;
private OutputStreamSupplier outputStreamSupplier;
static interface OutputStreamSupplier {
OutputStream get() throws IOException;
}
public OutputStreamMultiVolume(long maxBytePerOutput, OutputStreamSupplier outputStreamSupplier) throws IOException {
this.outputStreamSupplier = outputStreamSupplier;
this.maxBytePerVolume = maxBytePerOutput;
this.out = outputStreamSupplier.get();
}
#Override
public synchronized void write(byte[] bytes) throws IOException {
final int remainingBytesInVol = (int) (maxBytePerVolume - bytesInCurrentVolume);
if (remainingBytesInVol >= bytes.length) {
out.write(bytes);
bytesInCurrentVolume += bytes.length;
return;
}
out.write(bytes, 0, remainingBytesInVol);
switchOutput();
this.write(bytes, remainingBytesInVol, bytes.length - remainingBytesInVol);
}
#Override
public synchronized void write(int b) throws IOException {
if (bytesInCurrentVolume + 1 <= maxBytePerVolume) {
out.write(b);
bytesInCurrentVolume += 1;
return;
}
switchOutput();
out.write(b);
bytesInCurrentVolume += 1;
}
#Override
public synchronized void write(byte[] b, int off, int len) throws IOException {
final int remainingBytesInVol = (int) (maxBytePerVolume - bytesInCurrentVolume);
if (remainingBytesInVol >= len) {
out.write(b, off, len);
bytesInCurrentVolume += len;
return;
}
out.write(b, off, remainingBytesInVol);
switchOutput();
this.write(b, off + remainingBytesInVol, len - remainingBytesInVol);
bytesInCurrentVolume += len - remainingBytesInVol;
}
private void switchOutput() throws IOException {
out.flush();
out.close();
out = outputStreamSupplier.get();
bytesInCurrentVolume = 0;
}
#Override
public synchronized void close() throws IOException {
out.close();
}
#Override
public synchronized void flush() throws IOException {
out.flush();
}
}
I'm afraid that your original question was not fully explained, and so were not the answers you got.
You should not use nor extend BytArrayOutputStream for flushing, because its main feature is to "write data into a byte array": i.e.: all the data is in memory, so you can retrieve it at later through toByteArray.
If you want to flush your exceding data, you need a buffered aproach: It is enough with this construction:
OutputStream out=new FileOutputStream(...);
out=new BufferedOutputStream(out, 1024*1024);
In order to flush the data periodically, you can schedule a TimerTask to invoke flush:
Timer timer=new Timer(true);
TimerTask timerTask=new TimerTask(){
public void run()
{
try
{
out.flush();
}
catch (IOException e)
{
...
}
};
timer.schedule(timerTask, delay, period);
I guess you could try using a java.nio.ByteBuffer in combination with a java.nio.channel.Channels that has a method newChannel(OutputStream);
Like so:
ByteBuffer buffer = ByteBuffer.allocate(1024 * 1024);
//... use buffer
OutputStream out = ...
drainBuffer(buffer, out);
and
public void drainBuffer(ByteBuffer buffer, OutputStream stream) {
WritableByteChannel channel = Channels.newChannel(stream);
channel.write(buffer);
}

Java example of using ExecutorService and PipedReader/PipedWriter (or PipedInputStream/PipedOutputStream) for consumer-producer

I'm looking for a simple producer - consumer implementation in Java and don't want to reinvent the wheel
I couldn't find an example that uses both the new concurrency package and either of the Piped classes
Is there an example for using both PipedInputStream and the new Java concurrency package for this?
Is there a better way without using the Piped classes for such a task?
For your task it might be sufficient to just use a single thread and write to the file using a BufferedOutputStream as you are reading from the database.
If you want more control over the buffer size and the size of chunks being written to the file, you can do something like this:
class Producer implements Runnable {
private final OutputStream out;
private final SomeDBClass db;
public Producer( OutputStream out, SomeDBClass db ){
this.out = out;
this.db = db;
}
public void run(){
// If you're writing to a text file you might want to wrap
// out in a Writer instead of using `write` directly.
while( db has more data ){
out.write( the data );
}
out.flush();
out.close();
}
}
class Consumer implements Runnable {
private final InputStream in;
private final OutputStream out;
public static final int CHUNKSIZE=512;
public Consumer( InputStream in, OutputStream out ){
this.out = out;
this.in = in;
}
public void run(){
byte[] chunk = new byte[CHUNKSIZE];
for( int bytesRead; -1 != (bytesRead = in.read(chunk,0,CHUNKSIZE) );;){
out.write(chunk, 0, bytesRead);
}
out.close();
}
}
And in the calling code:
FileOutputStream toFile = // Open the stream to a file
SomeDBClass db = // Set up the db connection
PipedInputStream pi = new PipedInputStream(); // Optionally specify a size
PipedOutputStream po = new PipedOutputStream( pi );
ExecutorService exec = Executors.newFixedThreadPool(2);
exec.submit( new Producer( po, db ) );
exec.submit( new Consumer( pi, toFile ) );
exec.shutdown();
Also catch any exceptions that might be thrown.
Note that if this is all you're doing, there isn't any advantage to using an ExecutorService. Executors are useful when you have many tasks (too many to launch all of them in threads at the same time). Here you have only two threads that have to run at the same time, so calling Thread#start directly will have less overhead.

Flaws with PipedInputStream/PipedOutputStream

I've seen two answers on SO that claim that the PipedInputStream and PipedOutputStream classes provided by Java are flawed. But they did not elaborate on what was wrong with them. Are they really flawed, and if so in what way? I'm currently writing some code that uses them, so I'd like to know whether I'm taking a wrong turn.
One answer said:
PipedInputStream and PipedOutputStream are broken (with regards to threading). They assume each instance is bound to a particular thread. This is bizarre.
To me that seems neither bizarre nor broken. Perhaps the author also had some other flaws in mind?
Another answer said:
In practice they are best avoided. I've used them once in 13 years and I wish I hadn't.
But that author could not recall what the problem was.
As with all classes, and especially classes used in multiple threads, you will have problems if you misuse them. So I do not consider the unpredictable "write end dead" IOException that PipedInputStream can throw to be a flaw (failing to close() the connected PipedOutputStream is a bug; see the article Whats this? IOException: Write end dead, by Daniel Ferbers, for more information). What other claimed flaws are there?
They are not flawed.
As with all classes, and especially classes used in multiple threads, you will have problems if you misuse them. The unpredictable "write end dead" IOException that PipedInputStream can throw is not a flaw (failing to close() the connected PipedOutputStream is a bug; see the article Whats this? IOException: Write end dead, by Daniel Ferbers, for more information).
I have used them nicely in my project and they are invaluable for modifying streams on the fly and passing them around. The only drawback seemed to be that PipedInputStream had a short buffer (around 1024) and my outputstream was pumping in around 8KBs.
There is no defect with it and it works perfectly well.
-------- Example in groovy
public class Runner{
final PipedOutputStream source = new PipedOutputStream();
PipedInputStream sink = new PipedInputStream();
public static void main(String[] args) {
new Runner().doit()
println "Finished main thread"
}
public void doit() {
sink.connect(source)
(new Producer(source)).start()
BufferedInputStream buffer = new BufferedInputStream(sink)
(new Consumer(buffer)).start()
}
}
class Producer extends Thread {
OutputStream source
Producer(OutputStream source) {
this.source=source
}
#Override
public void run() {
byte[] data = new byte[1024];
println "Running the Producer..."
FileInputStream fout = new FileInputStream("/Users/ganesh/temp/www/README")
int amount=0
while((amount=fout.read(data))>0)
{
String s = new String(data, 0, amount);
source.write(s.getBytes())
synchronized (this) {
wait(5);
}
}
source.close()
}
}
class Consumer extends Thread{
InputStream ins
Consumer(InputStream ins)
{
this.ins = ins
}
public void run()
{
println "Consumer running"
int amount;
byte[] data = new byte[1024];
while ((amount = ins.read(data)) >= 0) {
String s = new String(data, 0, amount);
println "< $s"
synchronized (this) {
wait(5);
}
}
}
}
One flaw might be that there is not clear way for the writer to indicate to the reader that it encountered a problem:
PipedOutputStream out = new PipedOutputStream();
PipedInputStream in = new PipedInputStream(out);
new Thread(() -> {
try {
writeToOut(out);
out.close();
}
catch (SomeDataProviderException e) {
// Have to notify the reading side, but how?
}
}).start();
readFromIn(in);
The writer could close out, but maybe the reader misinterprets that as end of data. To handle this correctly additional logic is needed. It would be easier if functionality to manually break the pipe was provided.
There is now JDK-8222924 which requests a way to manually break the pipe.
From my point of view there is a flaw. More precisely there is a high risk of a deadlock if the Thread which should pump data into the PipedOutputStream dies prematurely before it actually writes a single byte into the stream. The problem in such a situation is that the implementation of the piped streams is not able to detect the broken pipe. Consequently the thread reading from PipedInputStream will wait forever (i.e. deadlock) in it's first call to read().
Broken pipe detection actually relies on the first call to write() as the implementation will than lazily initialize the write side thread and only from that point in time broken pipe detection will work.
The following code reproduces the situation:
import java.io.IOException;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import org.junit.Test;
public class PipeTest
{
#Test
public void test() throws IOException
{
final PipedOutputStream pout = new PipedOutputStream();
PipedInputStream pin = new PipedInputStream();
pout.connect(pin);
Thread t = new Thread(new Runnable()
{
public void run()
{
try
{
if(true)
{
throw new IOException("asd");
}
pout.write(0); // first byte which never get's written
pout.close();
}
catch(IOException e)
{
throw new RuntimeException(e);
}
}
});
t.start();
pin.read(); // wait's forever, e.g. deadlocks
}
}
The flaws that I see with the JDK implementation are:
1) No timeouts, reader or writer can block infinitely.
2) Suboptimal control over when data is transferred (should be done only with flush, or when circular buffer is full)
So I created my own to address the above, (timeout value passed via a ThreadLocal):
PipedOutputStream
How to use:
PiedOutputStreamTest
Hope it helps...

Categories