If I have a large byte array already in memory received from a SOAP response.
I have to write this byte array into an OutputStream.
It is OK just to use write:
byte [] largeByteArray=...;
outputstream.write(largeByteArray);
...
outputstream.flush();
...
or is better to split the bytearray in small chunks and to write that to the outputstream?
If you've already got the large array, then just write it out - if the output stream implementation chooses to chunk it, it can make that decision. I can't see a benefit in you doing that for it - which may well make it less efficient, if it's able to handle large chunks.
If you want to make this more efficient, I would write the data as you get it rather than building a large byte[] (and waiting until the end to start writing). If this is an option is it can be faster and more efficient. However if this is not an option use one large write.
What type of output stream are you using?
There are output streams that can write the array in chunks.
In general I believe if you issue an I/O operation (write) for each single byte, the performance may be poor, because I/O operations are expensive.
I can think of no conceivable reason it would be better without getting truly bizarre and absurd. Generally, if you can pass data between layers in larger chunks without additional effort, then you should do so. Often, it's even worth additional effort to do things that way, so why would you want to put in extra effort to make more work?
If largeByteArray is something really large , and write job cost long time , and memory is a considerable condition:
Split the array to parts, after write one part ,set the part=null, this release the reference of the part , would make the JVM/GC the part as soon as possible .
By split and release , you can do more write(largeByteArray) job at the same time, before OOM-ERROR occurs.
Notice:
during split stage, JVM need double arraysize memory to do so, but after split, original array will eventually get GC'd ,you are back to using the same amount of memory as before.
Example: a server has 1GB memory . it can run max 100 thread that each one holding and sending 10MB data to client sametime.
if you use the big 10MB array, memory use is always 1GB,no spare part even all thread has 1MB data not sent.
my solution is split 10MB to 10*1MB. after send some MB part , the sent part maybe get JVM/GC .and each thread is costing less average memory in whole life time. so server may run more tasks
Related
Background: I'm currently creating an application in which two Java programs communicate over a network using a DataInputStream and DataOutputStream.
Before every communication, I'd like to send an indication of what type of data is being sent, so the program knows how to handle it. I was thinking of sending an integer for this, but a byte has enough possible combinations.
So my question is, is Java's DataInputStream's readByte() faster than readInt()?
Also, on the other side, is Java's DataOutputStream's writeByte() faster than writeInt()?
If one byte will be enough for your data then readByte and writeByte will be indeed faster (because it reads/writes less data). It won't be noticeable difference though because the size of the data is very small in both cases - 1 vs 4 bytes.
If you have lots of data coming from the stream then using readByte or readInt will not make speed difference - for example calling readByte 4 times instead of readInt 1 time. Just use the one depending on what kind of data you expect and what makes your code easier to understand. You will have to read the whole stuff anyway :)
I have a file containing data that is meaningful only in chunks of certain size which is appended at the start of each chunk, for e.g.
{chunk_1_size}
{chunk_1}
{chunk_2_size}
{chunk_2}
{chunk_3_size}
{chunk_3}
{chunk_4_size}
{chunk_4}
{chunk_5_size}
{chunk_5}
.
.
{chunk_n_size}
{chunk_n}
The file is really really big ~ 2GB and the chunk size is ~20MB (which is the buffer that I want to have)
I would like to Buffer read this file to reduce the number to calls to actual hard disk.
But I am not sure how much buffer to have because the chunk size may vary.
pseudo code of what I have in mind:
while(!EOF) {
/*chunk is an integer i.e. 4 bytes*/
readChunkSize();
/*according to chunk size read the number of bytes from file*/
readChunk(chunkSize);
}
If lets say I have random buffer size then I might crawl into situations like:
First Buffer contains chunkSize_1 + chunk_1 + partialChunk_2 --- I have to keep track of leftover and then from the next buffer get the remaning chunk and concatenate to leftover to complete the chunk
First Buffer contains chunkSize_1 + chunk_1 + partialChunkSize_2 (chunk size is an integer i.e. 4 bytes so lets say I get only two of those from first buffer) --- I have to keep track of partialChunkSize_2 and then get remaning bytes from the next buffer to form an integer that actually gives me the next chunkSize
Buffer might not even be able to get one whole chunk at a time -- I have to keep hitting read until the first chunk is completely read into memory
You don't have much control over the number of calls to the hard disk. There are several layers between you and the hard disk (OS, driver, hardware buffering) that you cannot control.
Set a reasonable buffer size in your Java code (1M) and forget about it unless and until you can prove there is a performance issue that is directly related to buffer sizes. In other words, do not fall into the trap of premature optimization.
See also https://stackoverflow.com/a/385529/18157
you might need to do some analysis and have an idea of average buffer size, to read data.
you are saying to keep buffer-size and read data till the chunk is done ,to have some meaning full data
R u copying the file to some place else, or you sending this data to another place?
for some activities Java NIO packages have better implementations to deal with ,rather than reading data into jvm buffers.
the buffer size should be decent enough to read maximum chunks of data ,
If planning to hold data in memmory reading the data using buffers and holding them in memory will be still memory-cost operation ,buffers can be freed in many ways using basic flush operaitons.
please also check apache file-utils to read/write data
This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)
I have two scenarios in Netty where I am trying to minimize memory copies and optimize memory usage:
(1) Reading a very large frame (20 Megabites).
(2) Reading lots of very little frames (20 megabites at 50 bites per frame) to rebuild into one message at a higher level in the pipeline.
For the first scenario, as I get a length at the beginning of the frame, I extended FrameDecoder. Unfortunately as I don't see how to return the length to Netty (I only indicate whether the frame is complete or not), I believe Netty is going through multiple fill buffer, copy and realloc cycles thus using for more memory than is required. Is there something I am missing here? Or should I be avoiding the FrameDecoder entirely if I expect this scenario?
In the second scenario, I am currently creating a linked list of all the little frames which I wrap using ChannelBuffers.wrappedBuffer (which I can then wrap in a ChannelBufferInputStream), but I am again using far more memory than I expected to use (perhaps because the allocated ChannelBuffers have spare space?). Is this the right way to use Netty ChannelBuffers?
There is a specialized version of frame decoder called, LengthFieldBasedFrameDecoder. Its handy, when you have a header with message length. It can even extract the message length from header by giving an offset.
Actually, ChannelBuffers.wrappedBuffer does not creates copies of received data, it creates a composite buffer from given buffers, so your received frame data will not be copied. If you are holding the composite buffers/ your custom wrapper in the code and forgot to nullify, memory leaks can happen.
These are practices I follow,
Allocate direct buffers for long lived objects, slice it on use.
when I want to join/encode multiple buffers into one big buffer. I Use ChannelBuffers.wrappedBuffer
If I have a buffer and want to do something with it/portion of it, I make a slice of it by calling slice or slice(0,..) on channel buffer instance
If I have a channel buffer and know the position of data which is small, I always use getXXX methods
If I have a channel buffer, which is used in many places for make something out of it, always make it modifiable, slice it on use.
Note: channelbuffer.slice does not make a copy of the data, it creates a channel buffer with new reader & write index.
In the end, it appeared the best way to handle my FrameDecoder issue was to write my own on top of the SimpleChannelUpstreamHandler. As soon as I determined the length from the header, I created the ChannelBuffer with size exactly matching the length. This (along with other changes) significantly improved the memory performance of my application.
I am writing a program that has to copy a sizeable, but not huge amount of data from folder to folder (in the range of several dozen photos at once). Originally I was using java.io.FileOutputStream to simply read to buffer and write out, but then I heard about potential performance increases using java.nio.FileChannel.
I don't have the resources to run a serious, controlled test with the data I have, but there seems to be no consensus on what the advantages of each are (other than FileChannel being thread safe). Some users report FileChannel being great for smaller files, others report huge speed increases with larger files.
I am wondering if anyone knows exactly what the intent of creating FileChannel was in the first place: was it designed for better performance? In what cases? And is there a definitive performance increase for general kinds of data, or are the differences I should expect to see trivial because I am not working with data that is specialized enough?
EDIT: Assume my data does not need to be thread safe.
FileChannel.transferFrom/To should be faster than IO stream for file copying.
Or you can simply use Java 7's java.nio.file.Files.copy(source, target). That should be as fast as it can get.
However, in the end, performance won't be noticeably different - hard disk speed is the bottleneck.
FileChannel is not non-blocking, and it is not selectable. Not sure if they are going to add these features in future. Java 7 has AsynchronousFileChannel though.
Input and Output Streams assume a stream styled access to the file or resource. There are a few extra items which help (array reads) but the basic idea is that of a stream where you read in one or more characters at a time (possibly blocking until you have more characters available).
Channels are the means to copy information into Buffers. This provides a lower level of access to input and output routines. With thoughtful buffer sizing, the speed-ups can be impressive. Structuring your code around buffers can reduce the time spent in a read loop (also increasing performance). Finally, while it is possible to do pre-checking of input stream state in an attempt to avoid blocking, Channels and Buffers allow operations to perform in a non-blocking manner (even in the worst conditions).
Have you take a look at commons-io?
FileUtils.copyFileToDirectory(srcFile, destDir);