Can i use multiple threads to write in RandomAccessFile in Java ?
I know RandomAccessFiles allows to read & write at any position.
I want to create n portions of above File and let each thread to write
contents in a particular portion.
will it improve IO performance?
Eager to hear soon......
You can open the file twice with the proper sharing specified, having two RandomAccessFile objects pointing to the same file. The OS will manage properly if you're careful not to write and read the same location twice (the OS will handle it then, too, but you'll get unexpected results).
However, it will not improve your I/O performance - CPU is almost never the bottleneck when it comes to I/O. What is it you're trying to achieve?
Related
I am trying to read a huge file which contains a word(different length) per line.
I want to read it with multi-threading depends on the string length.
For example, thread one reads lines which has one length word, thread two reads two lengths and ...
Is there any way to achieve this? If it is, how will be affected the performance?
I found this examples, but I can't put together.
Reference 1 : Multithread file reading
Reference 2 : How to read files in multithreaded mode?
You can use multiple threads, however it won't be any faster. To find all the lines of a given length you have to read all the other lines.
Is there any way to achieve this?
Read all the lines and ignore the ones you filter out.
What you can do is to process different lines in different threads however it depends on how CPU intensive this is as to whether it helps or is slower.
Reading a file in multithreading mode can only make things slower, since disk drive has to move heads between multiple points of reading. Instead, transfer computational work from the reading thread to worker thread(s).
I am wondering is there a way to optimize reading from disk in java. I mean for example I want to print the contains of all text files in some directory, but after uppercase them. I can create another thread do uppercase them, but can I optimize reading by adding another(thread(s)) to read files too? I mean 2,3 or more threads to read difference files from disk. Is there some optimization for doing this or not? I hope that I explain the problem clearly.
I want to print the contains of all text files
This is most likely your bottleneck. If not, you should focus on what you bottleneck is as optimising anything else is likely to complicate your code for no benefit.
I can create another thread do uppercase them,
You can, though passing the work to another thread could be more expensive than making it uppercase depending on how your do this.
can I optimize reading by adding another(thread(s)) to read files too?
Possibly. How many disks do you have. If you have one disk, it can usually only do one thing at a time.
I mean 2,3 or more threads to read difference files from disk.
Most desktop drives can only do one operation at a time.
Is there some optimization for doing this or not?
Yes, but as I said, until you know what your bottleneck is, it's hard to jump to a solution.
I can create another thread do uppercase them
That's actually going in the right direction, but simply making all letters uppercase doesn't take enough time to really matter unless you're processing really large chunks of the file.
Because the standard single-threaded model of read-then-process means you're either reading data or processing it, when you could be doing both at the same time.
For example, you could be creating a series of highly compressed (say, JPEG2000 because it's so CPU intensive) images from a large video stream file. You could have one thread reading frames from the stream, placing them into a queue to process, and then have N threads each processing a frame into an image.
You'd tune the number of threads reading data and the number of threads processing data to keep both your disks and CPUs maximally busy without excess contention.
There are some cases where you can use multiple threads to read from a single file to get better performance. But you need a system designed from the ground up to do that. You need lots of disks (less so if they're SSDs), a pretty substantial IO infrastructure along with a system that has a lot of IO bandwidth, and then you need a file system that can handle multiple simultaneous access to a single file. Then the code you have to write to get better performance from reading using more than one thread has to match things like the physical layout of your files on disk.
That works best if you're doing lots of random reads from a file spread over multiple devices. Like a large, high-powered database server.
For example, lets say I have a huge data file spread over four or five disks (or even RAID arrays), with the file spread out over the disks in 64KB chunks. A handful of threads doing 64KB reads would be ideal to read or write such a file in a random-access mode. Let's say everything is really fast and you can read or write 1 GB/sec from such a file.
But if you turn around and just try to copy that data in a stream, you can still use multiple threads to get maximum performance - say 1 GB/sec - but if you just used a single thread to do read() calls in 1 MB chunks you'd probably get 950 MB/sec - or 95% or maximum multithreaded read performance.
I've actually benchmarked such systems and most of the time, multithreaded IO isn't worth the trouble unless you've invested a lot of money in your hardware and software (opensource file systems tend not to do this very well - you need to get into the realm of IBM's GPFS and Oracle's (nee LSC's then Sun's) QFS) and you know exactly what you're doing when you set it up.
I wrote program to read content from a simple 1GB file using a simple buffered reader.
I recorded the time from start to finish, as to calculate the time used.
An interesting observation I have made is that on the first run, the reading speed came out to about 80~90MB/s, but when I ran it a second time, it reads considerably faster, and at a speed of around 320MB/s.
I guess this might be a result of a memory caching problem, but I don't know how to fix it.
If caching is the problem, you should be able to use the method detailed here to clear your cache, assuming you're on a Linux system. This method requires super user access.
I guess what you want to do is to compare the speed difference between readline and some other methods of reading in a 1GB file and you are getting conflicting results from running readline a couple of times?
Perhaps randomize the file contents, or read different files.
I have big file more than 1 GB and I want to search for the occurrence of a certain word.
so I want to task over several threads where each thread will handle a portion of the file.
what is the best approach to do this, I thought about read the file into several buffers of fixed size and pass each thread a buffer.
is there a better way to do this
[EDIT] i want to execut each thread on different device
A ByteBuffer, say on a RandomAccessFile would be feasible for files < 2 GB (231).
The general solution would be to use FileChannel, with its MappedByteBuffer.
With several buffers one must take care to have overlapping buffers, so the word can be found on buffer boundaries.
Reading the thread into the buffers will probably take just as long as just doing the search (the extra processing required to search is tiny compared to the time needed to read the file off the disk - and in fact it may well be able to do that processing in the time it would otherwise just be waiting for data).
Searching multiple locations in the file at once will be very slow on most storage systems.
The real question comes as to whether you are only searching each file once or if you frequently search them. If only once then you have no real choice but to scan the file and take the time. If you are doing it frequently then you could consider indexing the contents somehow.
Consider using Hadoop MapReduce.
If you want to execute threads (= divided tasks) on different devices, the input file should be on a distributed file system such as HDFS (Hadoop Distributed File System). MapReduce is a mechanism to divide one job into multiple tasks and run them on different machines in parallel.
I am writing a program that has to copy a sizeable, but not huge amount of data from folder to folder (in the range of several dozen photos at once). Originally I was using java.io.FileOutputStream to simply read to buffer and write out, but then I heard about potential performance increases using java.nio.FileChannel.
I don't have the resources to run a serious, controlled test with the data I have, but there seems to be no consensus on what the advantages of each are (other than FileChannel being thread safe). Some users report FileChannel being great for smaller files, others report huge speed increases with larger files.
I am wondering if anyone knows exactly what the intent of creating FileChannel was in the first place: was it designed for better performance? In what cases? And is there a definitive performance increase for general kinds of data, or are the differences I should expect to see trivial because I am not working with data that is specialized enough?
EDIT: Assume my data does not need to be thread safe.
FileChannel.transferFrom/To should be faster than IO stream for file copying.
Or you can simply use Java 7's java.nio.file.Files.copy(source, target). That should be as fast as it can get.
However, in the end, performance won't be noticeably different - hard disk speed is the bottleneck.
FileChannel is not non-blocking, and it is not selectable. Not sure if they are going to add these features in future. Java 7 has AsynchronousFileChannel though.
Input and Output Streams assume a stream styled access to the file or resource. There are a few extra items which help (array reads) but the basic idea is that of a stream where you read in one or more characters at a time (possibly blocking until you have more characters available).
Channels are the means to copy information into Buffers. This provides a lower level of access to input and output routines. With thoughtful buffer sizing, the speed-ups can be impressive. Structuring your code around buffers can reduce the time spent in a read loop (also increasing performance). Finally, while it is possible to do pre-checking of input stream state in an attempt to avoid blocking, Channels and Buffers allow operations to perform in a non-blocking manner (even in the worst conditions).
Have you take a look at commons-io?
FileUtils.copyFileToDirectory(srcFile, destDir);