Where is the content of a Java piped stream 'stored'? - java

I am running my application under a profiler. The 'class' that has the most memory consumption is the char[] which is about 10 kB in my application.
I then created an InputStream (PipedInputStream to be exact) which holds a byte array data of 300 MB.
Then I took a look at my profiler, and I don't see any significant change ( don't see anywhere that something eats up 300 MB).
The question is, if that 300 MB of byte array is not in memory, where is Java keeping it?
[Update]
Additional info on how I got the 300 MB to my PipedInputStream:
I am developing a web app that has a file upload mechanism.
And in one of the processes in my file upload, I create an input stream (PipedInputStream). Basically,
I read the multipartfile's input stream (a few KB of byte[] at a time),
Created a PipedOutputStream
Created a PipedInputStream (passing the recently created output stream to the constructor)
Wrote the multipart's input stream to my PipedOutputStream (running on a separated thread; which flushes and closes that output stream before exiting the thread). At this point, I now have a copy of the multipart's bytes in my own input stream
Then (accidentally) stored that input stream in my http session (discussion/debate on whether that is a good idea would be on a different question)
So the question then again is, where is Java keeping my InputStream's content (I don't see it anywhere in my profiler)?
[Update#2]
I have a FileOutputStream which reads from the PipedInputStream and writes to a file.

A PipedInputStream just makes data available when it's written by the output stream that it's connected to. So long as you keep reading from your input stream as fast as it receives data from the output stream, there won't be much data to buffer.
If that doesn't help, you'll need to give more information about what you're doing with the piped input stream - what output stream is it connected to, and what's reading from it?
EDIT: You still haven't said what's reading from your PipedInputStream. Something must be, as otherwise the PipedOutputStream would block - PipedInputStream only have a fairly small buffer (by default).

A PipedInputStream does not store any data at all. Also, where do you get that 300 MB byte array from?

Related

Okio vs java.io performance

I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?
Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.

java : Does using buffered input stream make the conversion of input streams to byte[] more efficient?

I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.

Java read huge file ( ~100GB ) efficiently

I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?
If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.
Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});

How do I dump my decrypted text file byte blocks into an InputStreamReader?

In the CBC decryption loop I'm dealing with small (< 32 bytes) byte chunks so I can't use a StringBuilder because the Heap blows up. I figure I should take the decrypted bytes and dump them into some kind of buffered array. At this point I'm confused about how to setup and populate an InputStreamReader from these bytes. If I can populate this InputStreamReader then I want to wrap a BufferedReader around it. I then plan to read from the BufferedReader one line at a time because my text processing just needs to operate on one line at a time. I don't want to write any data to disk during this process. I'm just super confused about what to do with bytes that I'm getting from my CBC decryption loop. They obviously need to buffer (since a line of my text file is probably 20 times the size of a decrypted chunk) but I'm confused about the buffer that will act as the middleman. I'm using BouncyCastle but that piece of the puzzle isn't really causing me issues at the moment. ~Thanks for the newbie help.
Take the bytes from your decryption block, and dump them in to a PipedOutputStream. Then create a PipedInputStream from that, wrap appropriately, and feed it to your other code.
This is best done in two separate threads. It may work in one, but you'd have to be careful to not block (notably the reading), or you'll get stuck.
Or you could write your own custom InputStream implementation over your decrypter.

flush() java file handling

What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?
The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.
The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.
flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.

Categories