I have a piece of code that reads hell of a lot (hundreds of thousand) of relatively small files (couple of KB) from the local file system in a loop. For each file there is a java.io.FileInputStream created to read the content. The process its very slow and take ages.
Do you think that wrapping the FIS into java.io.BufferedInputStream would make a significant difference?
If you aren't already using a byte[] buffer of a decent size in the read/write loop (the latest implementation of BufferedInputStream uses 8KB), then it will certainly make difference. Give it a try yourself. Don't forget to make any OutputStream a BufferedOutputStream as well.
But if you already have buffered it using a byte[] and/or it after all makes only little difference, then you've hit the harddisk and I/O controller speed as the bottleneck.
I very much doubt whether that will make any difference.
Your fundamental problem is the hundreds of throusands of tiny files. Reading those is going to make the disk thrash and take forever, no matter how you do it, you'll spend 99,9% of the time waiting on mechanical movement inside the harddisk.
There are two ways to fix this:
Save your data on an SSD - they have much lower (as in five orders of magnitude less) latency.
Rearrange your data into few large files and read those sequentially
That depends on how you're reading the data. If you're reading from the FileInputStream in a very inefficient way (for example, calling read() byte-by-byte), then using a BufferedInputStream could improve things dramatically. But if you're already using a reasonable-sized buffer with FileInputStream, switching to a BufferedInputStream won't matter.
Since you're talking a large number of very small files, there's a strong possibility that a lot of the delay is due to directory operations (open, close), not the actual reading of bytes from the files.
Related
This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)
My applications does continuous disk i/o through 10 threads. The cpu profile is coming very high around 100%, i was planning to make it to 1 separate writer thread.
Also I was thinking if i can maintain a Cache of Buffered Writers and so do not have to continuously open the streams. Does anyone see a problem in this
But I am unsure of where to put the close the writer. Secondly if the writer are not closed will there be a problem.
Thanks
If you are writing short strings to the writers very often, opening a new writer might indeed be a waste of time. One issue to be aware of is that if you keep the streams open, other threads trying to write to the same file may fail. From the FileOutputStream documentation:
Some platforms, in particular, allow a file to be opened for writing by only one FileOutputStream (or other file-writing object) at a time.
Also, make sure you put the closing statement in a finally clause.
PS Björn is right, IO is unlikely to show up as CPU usage. Run a profile to work out what the app is actually doing
IO itself is unlikely to cause CPU load.
The character encoding conversions involved when converting strings from Java's internal representation to UTF8, ISO-Latin-1 or whatever the encoding it is that you are using, however does cause CPU load.
10000 lines per minute isn't that much, though. Only about 7000 characters per second (for 40 character lines). Or was that 10000 lines for each thread?
When I am using FileInputStream to read an object (say a few bytes), does the underlying operation involve:
1) Reading a whole block of disk so that if I subsequently do another read operation, it wouldnt require a real disk read as that portion of the file was already fetched in the last read operation?
OR
2) A new disk access to take place because FileInputStream does not do any buffering and bufferedInputStream should have been used instead to achieve the effect of (1)?
I think that since the FileInputStream uses the read system call and it reads only a set of pages from hard disk, some buffering must be take place.
FileInputStream will make an underlying native system call. Most OSes will do their own buffering for this. So it does not need a real disk seek for each byte. But still, you have the cost of making the native OS call, which is expensive. So BufferedStream would be preferable. However, for reading small amounts of data (like you say, a few bytes or even kBs), either one should be fine as the number of OS calls won't be that different.
Native code for FileInputStream is here: it doesn't look like there is any buffering going on in there. The OS buffering may kick in, but there's no explicit indicator one way or another if/when that happens.
One thing to look out for is reading from a mounted network volume over a slow connection. I ran into a big performance issue using a non-buffered FileInputStream for this. Didn't catch it in development, because the file system was local.
I am writing a program that has to copy a sizeable, but not huge amount of data from folder to folder (in the range of several dozen photos at once). Originally I was using java.io.FileOutputStream to simply read to buffer and write out, but then I heard about potential performance increases using java.nio.FileChannel.
I don't have the resources to run a serious, controlled test with the data I have, but there seems to be no consensus on what the advantages of each are (other than FileChannel being thread safe). Some users report FileChannel being great for smaller files, others report huge speed increases with larger files.
I am wondering if anyone knows exactly what the intent of creating FileChannel was in the first place: was it designed for better performance? In what cases? And is there a definitive performance increase for general kinds of data, or are the differences I should expect to see trivial because I am not working with data that is specialized enough?
EDIT: Assume my data does not need to be thread safe.
FileChannel.transferFrom/To should be faster than IO stream for file copying.
Or you can simply use Java 7's java.nio.file.Files.copy(source, target). That should be as fast as it can get.
However, in the end, performance won't be noticeably different - hard disk speed is the bottleneck.
FileChannel is not non-blocking, and it is not selectable. Not sure if they are going to add these features in future. Java 7 has AsynchronousFileChannel though.
Input and Output Streams assume a stream styled access to the file or resource. There are a few extra items which help (array reads) but the basic idea is that of a stream where you read in one or more characters at a time (possibly blocking until you have more characters available).
Channels are the means to copy information into Buffers. This provides a lower level of access to input and output routines. With thoughtful buffer sizing, the speed-ups can be impressive. Structuring your code around buffers can reduce the time spent in a read loop (also increasing performance). Finally, while it is possible to do pre-checking of input stream state in an attempt to avoid blocking, Channels and Buffers allow operations to perform in a non-blocking manner (even in the worst conditions).
Have you take a look at commons-io?
FileUtils.copyFileToDirectory(srcFile, destDir);
I am sequentially processing a large file and I'd like to keep a large chunk of it in memory, 16gb ram available on a 64 bit system.
A quick and dirty way is to do this, is simply wrap the input stream into a buffered input stream, unfortunately, this only gives me a 2gb buffer. I'd like to have more of it in memory, what alternatives do I have?
How about letting the OS deal with the buffering of the file? Have you checked what the performance impact of not copying the whole file into JVMs memory is?
EDIT: You could then use either RandomAccessFile or the FileChannel to efficiently read the necessary parts of the file into the JVMs memory.
Have you considered the MappedByteBuffer in java.nio? It's over my head but maybe it is what you are looking for.
I doubt that buffering more than 2gb at a time is going to be a huge win anyway. Depending on the amount of processing you're doing, you might be able to read in nearly as fast as you process. To speed it up, you might try using a two-threaded producer-consumer model (one thread reads the file and hands the data off to the other thread for processing).
The OS is going to cache as much of the file as it can, so trying to outsmart the cache manager probably isn't going to get you very much.
From a performance perspective, you will be much better served by keeping the bytes outside the JVM (transferring huge chunks of data between the OS and JVM is relatively slow). You can achieve this goal by using a MappedByteBuffer backed by a direct memory block.
Here's a pertinent how-to type of article: article
I think there are 64 bit JVMs that will support nonstandard limits.
You might try buffering chunks.