Sorting file with multi threads

Sorting file with multi threads - java

I am sorting big file with by reading into chunks (Arraylist), sorting each arraylist using Collections.sort with custom comparator and writing the sorted results into files and then applying merge sort algorithm on all files.
I do it in one thread.
Will I get any performance boost if I start a new thread for every Collections.sort()?
By this I mean the following:
I read from file into List, when List is full I start a new thread where I sort this List and write to temp file.
Meanwhile I continue to read from file and start a new thread when the list is full again...
Another question that I have:
What is better for sorting:
1)Arraylist that I fill and when it's full apply collections.sort()
2)TreeMap that i fill, I don't need to sort it. (it's sorts as I insert items)
NOTE: I use JAVA 1.5
UPDATE:
This is a code I want to use, the problem are that I am reusing datalines arraylist that is beeing used by threads and also I need to wait until all threads complete.
how do i fix?
int MAX_THREADS = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
List datalines = ArrayList();
try {
while (data != null) {
long currentblocksize = 0;
while ((currentblocksize <= blocksize) && (data = getNext()) != null) {
datalines.add(data);
currentblocksize += data.length();
}
executor.submit(new Runnable() {
public void run() {
Collections.sort(datalines,mycomparator);
vector.add(datalines);
}
});

I suggest you to implement the following scheme, known as a farm:
worker0
reader --> worker1 --> writer
...
workerN
Thus, one thread reads a chunk from the file, hands it to a worker thread (best practice is to have the workers as an ExecutorService) to sort it and then each worker sends their output to the writer thread to put in a temp file.
Edit: Ok, I've looked at your code. To fix the issue with the shared datalines, you can have a private member for each thread that stores the current datalines array that the thread needs to sort:
public class ThreadTask implements Runnable {
private List datalines = new ArrayList();
public ThreadTask(List datalines) {
this.datalines.add(datalines);
}
public void run() {
Collections.sort(datalines,mycomparator);
synchronized(vector) {
vector.add(datalines);
}
}
}
You also need to synchronize access to the shared vector collection.
Then, to wait for all threads in the ExecutorService to finish use:
executor.awaitTermination(30, TimeUnit.SECONDS);

Whether using threads will speed things up depends on whether you're limited by disk I/O or by CPU speed. This depends how fast your disks are (SSD is much faster than spinning hard disks), and on how complex your comparison function is. If the limit is disk I/O, then there's no point in adding threads or worrying about data structures, because those won't help you read the data from disk any faster. If the limit is CPU speed, you should run a profiler first to make sure your comparison function isn't doing anything slow and silly.

The answer to the first question is - yes. You will gain performance boost if you implement a parallelised version of the Merge Sort. More about this in this Dr.Dobbs article: http://drdobbs.com/parallel/229400239 .

If your process is CPU bound (which I suspect its not) you can see an improvement using multiple threads. If your process is IO bound, you need to improve your IO bandwidth and operation speed.

Parallelizing a sequential operation will improve performance in three cases:
You have a CPU-bound application, and have multiple cores that can do work without coordination. In this case, each core can do its work and you'll see linear speedup. If you don't have multiple cores, however, multi-threading will actually slow you down.
You have an IO-bound application, in which you're performing IO via independent channels. This is the case with an application server interacting with multiple sockets. The data on a given socket is relatively unimpeded by whatever's happening on other sockets. It is generally not the case with disk IO, unless you can ensure that your disk operations are going to separate spindles, and potentially separate controllers. You generally won't see much of a speedup here, because the application will still be spending much of its time waiting. However, it can lead to a much cleaner programming model.
You interleave IO and CPU. In this case one thread can be performing the CPU-intensive operation while the other thread waits on IO. The speedup, if any, depends on the balance between CPU and IO in the application; in many (most) cases, the CPU contribution is negligible compared to IO.
You describe case #3, and to determine the answer you'd need to measure your CPU versus IO. One way to do this is with a profiler: if 90% of your time is in FileInputStream.read(), then you're unlikely to get a speedup. However, if 50% of your time is there, and 50% is in Arrays.sort(), you will.
However, I saw one of your comments where you said that you're parsing the lines inside the comparator. If that's the case, and Arrays.sort() is taking a significant amount of time, then I'm willing to bet that you'd get more of a speed boost by parsing on read.

Related

does multi threading improve performance? scenario java [duplicate]

This question already has answers here:
Does multi-threading improve performance? How?
(2 answers)
Closed 8 years ago.
I have a List<Object> objectsToProcess.Lets say it contains 1000000 item`s. For all items in the array you then process each one like this :
for(Object : objectsToProcess){
Go to database retrieve data.
process
save data
}
My question is : would multi threading improve performance? I would of thought that multi threads are allocated by default by the processor anyways?

In the described scenario, given that process is a time-consuming task, and given that the CPU has more than one core, multi-threading will indeed improve the performance.
The processor is not the one who allocates the threads. The processor is the one who provides the resources (virtual CPUs / virtual processors) that can be used by threads by providing more than one execution unit / execution context. Programs need to create multiple threads themselves in order to utilize multiple CPU cores at the same time.
The two major reasons for multi-threading are:
Making use of multiple CPU cores which would otherwise be unused or at least not contribute to reducing the time it takes to solve a given problem - if the problem can be divided into subproblems which can be processed independently of each other (parallelization possible).
Making the program act and react on multiple things at the same time (i.e. Event Thread vs. Swing Worker).
There are programming languages and execution environments in which threads will be created automatically in order to process problems that can be parallelized. Java is not (yet) one of them, but since Java 8 it's on a good way to that, and Java 9 maybe will bring even more.
Usually you do not want significantly more threads than the CPU provides CPU cores, for the simple reason that thread-switching and thread-synchronization is overhead that slows down.
The package java.util.concurrent provides many classes that help with typical problems of multithreading. What you want is an ExecutorService to which you assign the tasks that should be run and completed in parallel. The class Executors provides factor methods for creating popular types of ExecutorServices. If your problem just needs to be solved in parallel, you might want to go for Executors.newCachedThreadPool(). If your problem is urgent, you might want to go for Executors.newWorkStealingPool().
Your code thus could look like this:
final ExecutorService service = Executors.newWorkStealingPool();
for (final Object object : objectsToProcess) {
service.submit(() -> {
Go to database retrieve data.
process
save data
}
});
}
Please note that the sequence in which the objects would be processed is no longer guaranteed if you go for this approach of multithreading.
If your objectsToProcess are something which can provide a parallel stream, you could also do this:
objectsToProcess.parallelStream().forEach(object -> {
Go to database retrieve data.
process
save data
});
This will leave the decisions about how to handle the threads to the VM, which often will be better than implementing the multi-threading ourselves.
Further reading:
http://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html#executing_streams_in_parallel
http://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html

Depends on where the time is spent.
If you have a load of calculations to do then allocating work to more threads can help, as you say each thread may execute on a separate CPU. In such a situation there is no value in having more threads than CPUs. As Corbin says you have to figure out how to split the work across the threads and have responsibility for starting the threads, waiting for completion and aggregating the results.
If, as in your case, you are waiting for a database then there can be additional value in using threads. A database can serve several requests in paraallel (the database server itself is multi-threaded) so instead of coding
for(Object : objectsToProcess){
Go to database retrieve data.
process
save data
}
Where you wait for each response before issuing the next, you want to have several worker threads each performing
Go to database retrieve data.
process
save data
Then you get better throughput. The trick though is not to have too many worker threads. Several reasons for that:
Each thread is uses some resources, it has it's own stack, its own
connection to the database. You would not want 10,000 such threads.
Each request uses resources on the server, each connection uses memory, each database server will only serve so many requests in parallel. You have no benefit in submitting thousands of simultaneous requests if it can only server tens of them in parallel. Also If the database is shared you probably don't want to saturate the database with your requests, you need to be a "good citizen".
Net: you will almost certainly get benefit by having a number of worker threads. The number of threads that helps will be determined by factors such as the number of CPUs you have and the ratio between the amount of processing you do and the response time from the DB. You can only really determine that by experiment, so make the number of threads configurable and investigate. Start with say 5, then 10. Keep your eye on the load on the DB as you increase the number of threads.

Java: Asynchronous concurrent writes to disk

I have a function in my Main thread which will write some data to disk. I don't want my Main thread to stuck (High Latency of Disk I/O) and creating a new thread just to write is an overkill. I have decided to use ExecutorService.
ExecutorService executorService = Executors.newFixedThreadPool(3);
Future future = executorService.submit(new Callable<Boolean>() {
public Boolean call() throws Exception {
logger.log(Level.INFO, "Writing data to disk");
return writeToDisk();
}
});
writeToDisk is the function which will write to disk
Is it a nice way to do? Could somebody refer better approach if any?
UPDATE: Data size will be greater than 100 MB. Disk bandwidth is 40 MBps, so the write operation could take couple of seconds. I don't want calling function to stuck as It has to do other jobs, So, I am looking for a way to schedule Disk I/O asynchronous to the execution of the calling thread.
I need to delegate the task and forget about it!

Your code looks good anyways I've used AsynchronousFileChannel from new non-blocking IO. The implementation uses MappedByteBuffer through FileChannel. It might give you the performance which #Chris stated. Below is the simple example:
public static void main(String[] args) {
String filePath = "D:\\tmp\\async_file_write.txt";
Path file = Paths.get(filePath);
try(AsynchronousFileChannel asyncFile = AsynchronousFileChannel.open(file,
StandardOpenOption.WRITE,
StandardOpenOption.CREATE)) {
asyncFile.write(ByteBuffer.wrap("Some text to be written".getBytes()), 0);
} catch (IOException e) {
e.printStackTrace();
}
}

There are two approaches that I am aware of, spin up a thread (or use a pool of threads as you have) or memory map a file and let the OS manage it. Both are good approaches, memory mapping a file can be as much as 10x faster than using Java writers/streams and it does not require a separate thread so I often bias towards that when performance is key.
Either way, as a few tips to optimize disk writing try to preallocate the file where possible. Resizing a file is expensive. Spinning disks do not like seeking, and SSDs do not like mutating data that has been previously written.
I wrote some benchmarks to help me explore this area awhile back, feel free to run the benchmarks yourself. Amongst them is an example of memory mapping a file.

I would agree with Õzbek to use a non-blocking approach. Yet, as pointed by Dean Hiller, we cannot close the AsynchronousFileChannel before-hand using a try with resources because it may close the channel before the write has completed.
Thus, I would add a CompletionHandler on asyncFile.write(…,new CompletionHandler< >(){…} to track completion and close the underlying AsynchronousFileChannel after conclusion of write operation. To simplify the use I turned the CompletionHandler into a CompletableFuture which we can easily chain with a continuation to do whatever we want after write completion. The final auxiliary method is CompletableFuture<Integer> write(ByteBuffer bytes) which returns a CompletableFuture of the final file index after the completion of the corresponding write operation.
I placed all this logic in an auxiliary class AsyncFiles that can be used like this:
Path path = Paths.get("output.txt")
AsyncFiles
.write(path, bytes)
.thenAccept(index -> /* called on completion from a background thread */)

Java ExecutorService - sometimes slower than sequential processing?

I'm writing a simple utility which accepts a collection of Callable tasks, and runs them in parallel. The hope is that the total time taken is little over the time taken by the longest task. The utility also adds some error handling logic - if any task fails, and the failure is something that can be treated as "retry-able" (e.g. a timeout, or a user-specified exception), then we run the task directly.
I've implemented this utility around an ExecutorService. There are two parts:
submit() all the Callable tasks to the ExecutorService, storing the Future objects.
in a for-loop, get() the result of each Future. In case of exceptions, do the "retry-able" logic.
I wrote some unit tests to ensure that using this utility is faster than running the tasks in sequence. For each test, I'd generate a certain number of Callable's, each essentially performing a Thread.sleep() for a random amount of time within a bound. I experimented with different timeouts, different number of tasks, etc. and the utility seemed to outperform sequential execution.
But when I added it to the actual system which needs this kind of utility, I saw results that were very variable - sometimes the parallel execution was faster, sometimes it was slower, and sometimes it was faster, but still took a lot more time than the longest individual task.
Am I just doing it all wrong? I know ExecutorService has invokeAll() but that swallows the underlying exceptions. I also tried using a CompletionService to fetch task results in the order in which they completed, but it exhibited more or less the same behavior. I'm reading up now on latches and barriers - is this the right direction for solving this problem?

I wrote some unit tests to ensure that using this utility is faster than running the tasks in sequence. For each test, I'd generate a certain number of Callable's, each essentially performing a Thread.sleep() for a random amount of time within a bound
Yeah this is certainly not a fair test since it is using neither CPU nor IO. I certainly hope that parallel sleeps would run faster than serial. :-)
But when I added it to the actual system which needs this kind of utility, I saw results that were very variable
Right. Whether or not a threaded application runs faster than a serial one depends a lot on a number of factors. In particular, IO bound applications will not improve in performance since they are bound by the IO channel and really cannot do concurrent operations because of this. The more processing that is needed by the application, the larger the win is to convert it to be multi-threaded.
Am I just doing it all wrong?
Hard to know without more details. You might consider playing around with the number of threads that are running concurrently. If you have a ton of jobs to process you should not be using a Executos.newCachedThreadPool() and should optimized the newFixedSizeThreadPool(...) depending on the number of CPUs your architecture has.
You also may want to see if you can isolate the IO operations in a few threads and the processing to other threads. Like one input thread reading from a file and one output thread (or a couple) writing to the database or something. So multiple sized pools may do better for different types of tasks instead of using a single thread-pool.
tried using a CompletionService to fetch task results in the order in which they completed
If you are retrying operations, using a CompletionService is exactly the way to go. As jobs finish and throw exceptions (or return failure), they can be restarted and put back into the thread-pool immediately. I don't see any reason why your performance problems would be because of this.

Multi-threaded programming doesn't come for free. It has an overhead. The over head can easily exceed and performance gain and usually makes your code more complex.
Additional threads give access to more cpu power (assuming you have spare cpus) but in general they won't make you HDD spin faster , give you more network bandwidth or speed up something which is not cpu bound.
Multiple threads can help give you a greater share of an external resource.

The best way to divide work between threads when processing a queue of documents

We have an application which processes a queue of documents (basically all the documents found in an input directory). The documents are read in one by one and are then processed. The application is an obvious candidate for threading since the results from processing one document are completely independent from the results of processing any other document. The question I have is how to divide the work.
One obvious way to split the work is to count the number of documents in the queue, divide by the number of available processors and split the work accordingly (example, the queue has 100 documents and I have 4 available processors, I create 4 threads and feed 25 documents from the queue to each thread).
However, a coworker suggests that I can just spawn a thread for each document in the queue and let the java JVM sort it out. I don't understand how this could work. I do get that the second method results in cleaner code, but is it just as efficient (or even more efficient) than the first method?
Any thoughts would be appreciated.
Elliott

We have an application which processes a queue of documents ... how to divide the work?
You should use the great ExecutorService classes. Something like the following would work. You would submit each of your files to the thread-pool and they will be processed by the 10 working threads.
// create a pool with 10 threads
ExecutorService threadPool = Executors.newFixedThreadPool(10);
for (String file : files) {
threadPool.submit(new MyFileProcessor(file));
}
// shutdown the pool once you've submitted your last job
threadPool.shutdown();
...
public class MyFileProcessor implements Runnable {
private String file;
public MyFileProcessor(String file) {
this.file = file;
}
public run() {
// process the file
}
}

In general, there are three ways to do work-splitting among threads.
First, static partitioning. This is where you count and divide the documents statically (i.e., without taking into account how long will it take to process each document). This approach is very efficient (and often easy to code), however, it can result in poor performance if documents take different amounts of time to process. One thread can accidentally get stuck with all long documents which will imply that it will run the longest and your parallelism will be limited.
Second, dynamic partitioning (you did not mention this). Spawn a fixed number of threads and let each thread work in a simple loop:
While not done:
Dequeue a document
Process document
In this manner you avoid the load imbalance. You incur the overhead of accessing the queue after the processing of each document but that will be negligible as long as each document's processing is substantially longer than a queue access (hence, I think you should be).
Third, let the JVM do your work-scheduling. This is where you span N threads and let them fight it out. This approach is rather simple but its downside is that you will rely heavily on JVMs thread scheduling and it can be very slow if JVM doesn't do a great job at it. Having too many threads that thrash each other can be very slow. I hope JVM is better than that so this may be worth a try.
Hope this helps.

Don't spawn a thread for each document but schedule a Runnable task at a Threadpool that has e.g. as many threads as processors.

You don't need to split the documents that way. Just create a fixed number of worker threads (i.e. create two worker threads using Executors.newFixedThreadPool(2)), and each can only process one document at a time. When it has finished processing one document, it grabs a new document from a shared list.

Multithreading approach to find text pattern in files

Consider simple Java application which should traverse files tree in a disc to find specific pattern in the body of the file.
Wondering is it possible to achieve better performance, using multi-threading, for example when we find new folder we submit new Runnable in fixed ThreadPool. Runnable task should traverse folder to find out new folders etc. In my opinion this operation should be IO bound, not CPU bound, so spawning new Thread would not improve performance.
Does it depends of hard drive type ? ( hdd, ... etc)
Does it depends on OS type ?
IMHO the only thing that can be parallel is - to spawn new Thread for parsing file content to find out pattern in file body.
What is the common pattern to solve this problem it ? Should it be multi-threaded or single-threaded ?

I've performed some research in this area while working under test project, you can look at the project on github at: http://github.com/4ndrew/filesearcher. Of course the main problem is a disk I/O speed, but if you would use optimal count of threads to perform reading/searching in parallel you can get better results in common.
UPD: Also look at this article http://drdobbs.com/parallel/220300055

I did some experiments on just this question some time ago. In the end I concluded that I could achieve a far better improvement by changing the way I accessed the file.
Here's the file walker I eventually ended up using:
// 4k buffer size ... near-optimal for Windows.
static final int SIZE = 4 * 1024;
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter h, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
// How much I've read.
long red = 0L;
do {
// How much to read this time around.
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
// Map a byte buffer to the file.
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
// How much to get.
int nGet;
// Walk the buffer to the end or until the hunter has finished.
while (mb.hasRemaining() && h.ok()) {
// Get a max of 4k.
nGet = Math.min(mb.remaining(), SIZE);
// Get that much.
mb.get(buffer, 0, nGet);
// Offer each byte to the hunter.
for (int i = 0; i < nGet && h.ok(); i++) {
h.check(buffer[i]);
}
}
// Keep track of how far we've got.
red += read;
// Stop at the end of the file.
} while (red < ch.size() && h.ok());
// Finish off.
h.close();
ch.close();
f.close();
}

You stated it right that you need to determine if your task is CPU or IO bound and then decide could it benefit from multithreading or not. Generally disk operations are pretty slow so unless amount of data you need to parse and parsing complexity you might not benefit from multithreading much. I would just write a simple test - just to read files w/o parsing in single thread, measure it and then add parsing and see if it's much slower and then decide.
Perhaps good design would be to use two threads - one reader thread that reads the files and places data in (bounded) queue and then another thread (or better use ExecutorService) parses the data - it would give you nice separation of concerns and you could always tweak number of threads doing parsing. I'm not sure if it makes much sense to read disk with multiple threads (unless you need to read from multiple physical disks etc).

What you could do is this: implement a single-producer multi-consumer pattern, where one thread searches the disk, retrieves files and then the consumer threads process them.
You are right that in this case using multiple threads to scan the disk would not be beneficial, in fact it would probably degrade performance, since the disk needs to seek the next reading position every time, so you end up bouncing the disk between the threads.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.