I have a project about concurrency and I have some trouble with the behaviour of my code. I am walking a file tree to find all files, and if I find a file which ends on .txt I submit a task to the executor. A thread will open the file and check what number in the file is the biggest. I then create an object, which holds the path for the file and the biggest number for that file. I append the object to a synchronized arraylist. But when I run the code, my arraylist sometimes have 1 object in it or 5 or 112 or 64. There should be 140 objects every time I run it. I hope you guys knows what the problem is.
public static List< Result > AllFiles( Path dir ) throws InterruptedException{
final List<Result> resultlist = new ArrayList<Result>();
final List<Result> synclist;
synclist = Collections.synchronizedList(resultlist);
ExecutorService exec
= Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
try {
Files.walk(dir).forEach(i -> {
String pathfile = i.getFileName().toString();
if (pathfile.contains(".txt")) {
exec.submit(() -> {
int high = findHighest(i);
ResultObj obj = new ResultObj(i, high);
synclist.add(obj);
});
}
});
exec.shutdown();
try {
exec.awaitTermination(1, TimeUnit.NANOSECONDS);
} catch (InterruptedException ex) {}
} catch (IOException ex) {}
System.out.println(synclist);
System.out.println(synclist.size());
return synclist;
}
You are only waiting 1 nanosecond in your awaitTermination call for your ExecutorService to shut down. As a result, you may be printing synclist before some of your files have been processed.
Related
So I have a large text file, in this case it's roughly 4.5 GB, and I need to process the entire file as fast as is possible. Right now I have multi-threaded this using 3 threads (not including the main thread). An input thread for reading the input file, a processing thread to process the data, and an output thread to output the processed data to a file.
Currently, the bottleneck is the processing section. Therefore, I'd like to add more processing threads into the mix. However, this creates a situation where I've got multiple threads accessing the same BlockingQueue, and their results are therefore not maintaining the order of the input file.
An example of the functionality I'm looking for would be something like this:
Input file: 1, 2, 3, 4, 5
Output file: ^ the same. Not 2, 1, 4, 3, 5 or any other combination.
I've written a dummy program that is identical in functionality to the actual program minus the processing part, (I can't give you the actual program due to the processing class containing info that is confidential). I should also mention, all of the classes (Input, Processing, and Output) are all Inner classes contained within a Main class that contains the initialise() method and the class level variables mentioned in the main thread code listed below.
Main thread:
static volatile boolean readerFinished = false; // class level variables
static volatile boolean writerFinished = false;
private void initialise() throws IOException {
BlockingQueue<String> inputQueue = new LinkedBlockingQueue<>(1_000_000);
BlockingQueue<String> outputQueue = new LinkedBlockingQueue<>(1_000_000); // capacity 1 million.
String inputFileName = "test.txt";
String outputFileName = "outputTest.txt";
BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
BufferedWriter writer = new BufferedWriter(new FileWriter(outputFileName));
Thread T1 = new Thread(new Input(reader, inputQueue));
Thread T2 = new Thread(new Processing(inputQueue, outputQueue));
Thread T3 = new Thread(new Output(writer, outputQueue));
T1.start();
T2.start();
T3.start();
while (!writerFinished) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
reader.close();
writer.close();
System.out.println("Exited.");
}
Input thread: (Please forgive the commented debug code, was using it to ensure the reader thread was actually executing properly).
class Input implements Runnable {
BufferedReader reader;
BlockingQueue<String> inputQueue;
Input(BufferedReader reader, BlockingQueue<String> inputQueue) {
this.reader = reader;
this.inputQueue = inputQueue;
}
#Override
public void run() {
String poisonPill = "ChH92PU2KYkZUBR";
String line;
//int linesRead = 0;
try {
while ((line = reader.readLine()) != null) {
inputQueue.put(line);
//linesRead++;
/*
if (linesRead == 500_000) {
//batchesRead += 1;
//System.out.println("Batch read");
linesRead = 0;
}
*/
}
inputQueue.put(poisonPill);
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
readerFinished = true;
}
}
Processing thread: (Normally this would actually be doing something to the line, but for purposes of the mockup I've just made it immediately push to the output thread). If necessary we can simulate it doing some work by making the thread sleep for a small amount of time for each line.
class Processing implements Runnable {
BlockingQueue<String> inputQueue;
BlockingQueue<String> outputQueue;
Processing(BlockingQueue<String> inputQueue, BlockingQueue<String> outputQueue) {
this.inputQueue = inputQueue;
this.outputQueue = outputQueue;
}
#Override
public void run() {
while (true) {
try {
if (inputQueue.isEmpty() && readerFinished) {
break;
}
String line = inputQueue.take();
outputQueue.put(line);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
Output thread:
class Output implements Runnable {
BufferedWriter writer;
BlockingQueue<String> outputQueue;
Output(BufferedWriter writer, BlockingQueue<String> outputQueue) {
this.writer = writer;
this.outputQueue = outputQueue;
}
#Override
public void run() {
String line;
ArrayList<String> outputList = new ArrayList<>();
while (true) {
try {
line = outputQueue.take();
if (line.equals("ChH92PU2KYkZUBR")) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer finished - executing termination");
writerFinished = true;
break;
}
line += "\n";
outputList.add(line);
if (outputList.size() == 500_000) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer wrote batch");
outputList = new ArrayList<>();
}
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
}
}
}
So right now the general data flow is very linear, looking something like this:
Input > Processing > Output.
But what I'd like to have is something like this:
But the catch is, when the data gets to output, it either needs to be sorted into the correct order, or it needs to already be in the correct order.
Recommendations or examples on how to go about this would be greatly appreciated.
In the past I have used the Future and Callable interfaces to solve a task involving parallel data flows like this, but unfortunately that code was not reading from a single queue, and so is of minimal help here.
I should also add, for those of you that will notice this, batchSize and poisonPill are normally defined in the main thread and then passed around via variables, they are not usually hard coded as they are in the code for Input thread, and the output checks for the writer thread. I was just a wee bit lazy when writing the mockup for experimentation at ~1am.
Edit: I should also mention, this is required to use Java 8 at most. Java 9 features and above cannot be used due to these versions not being installed in the environments in which this program will be run.
What you could do:
Take X threads for processing, where X is the number of cores available for processing
Give each thread its own input queue.
The reader thread gives records to each thread's input queue round-robin in a predictable fashion.
Since the output files are too big for memory, you write X output files, one for each thread, and each file name has the index of the thread in it, so that you can reconstitute the original order from the file names.
After the process is complete, you merge the X output files. One line from the file for thread 1, one from the files for thread 2, etc. in a round-robin fashion again. This reconstitutes the original order.
As an added bonus, since you have an input queue per thread, you don't have lock contention on the queue between readers. (only between the reader and the writer) You could even optimize this by putting things in the input queues in batches larger than 1.
As was also proposed by Alexei, you can create OrderedTask:
class OrderedTask implements Comparable<OrderedTask> {
private final Integer index;
private final String line;
public OrderedTask(Integer index, String line) {
this.index = index;
this.line = line;
}
#Override
public int compareTo(OrderedTask o) {
return index < o.getIndex() ? -1 : index == o.getIndex() ? 0 : 1;
}
public Integer getIndex() {
return index;
}
public String getLine() {
return line;
}
}
As an output queue you can use your own backed by priority queue:
class OrderedTaskQueue {
private final ReentrantLock lock;
private final Condition waitForOrderedItem;
private final int maxQueuesize;
private final PriorityQueue<OrderedTask> backedQueue;
private int expectedIndex;
public OrderedTaskQueue(int maxQueueSize, int startIndex) {
this.maxQueuesize = maxQueueSize;
this.expectedIndex = startIndex;
this.backedQueue = new PriorityQueue<>(2 * this.maxQueuesize);
this.lock = new ReentrantLock();
this.waitForOrderedItem = this.lock.newCondition();
}
public boolean put(OrderedTask item) {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.size() >= maxQueuesize && item.getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
boolean result = this.backedQueue.add(item);
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
public OrderedTask take() {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.peek() == null || this.backedQueue.peek().getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
OrderedTask result = this.backedQueue.poll();
expectedIndex++;
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
}
StartIndex is the index of the first ordered task, and
maxQueueSize is used to stop processing of other tasks (not to fill the memory), when we wait for some earlier task to finish. It should be double/tripple of the number of processing thread, to not stop the processing immediatelly and allow the scalability.
Then you should create your task :
int indexOrder =0;
while ((line = reader.readLine()) != null) {
inputQueue.put(new OrderedTask(indexOrder++,line);
}
The line by line is only used because of your example. You should change the OrderedTask to support the batch of lines.
Why not reverse the flow ?
Output call for X batches;
Generate X promise/task (promise pattern) who will call randomly one of the processing core (keep a batch number, to pass through to the input core); batch the calls handler into a ordered list;
Each processing core call for a batch in the input core;
Enjoy ?
I am using UNO API Library (Soffice) from Libreoffice 6.0 to convert ms office formats to PDF, the Soffice process serves multiple sumultanious requests in server mode.
Usually the convertion is fast, but while converting some large files, e.g. xlsx or pptx, the Soffice process uses 100% CPU and convertion takes up to a few minutes.
This is unacceptable, because during this time concurrent requests are not treated.
To handle this situation I tried to use java.util.concurrent to execute some subtasks as threads with timeout control via future interface. But it works good only if hanging occured on original ms office document load stage of convertion.
If process of convertion has already started, even though timeout exception occures, Soffice process does not quit 100% load at once, but contimue to convert document to pdf.
Program execution pauses trying to dispose loaded document.
SOffice process is started under linux via command:
Runtime.getRuntime().exec("/usr/lib64/libreoffice/program/soffice, --nologo, --nodefault, --norestore, --nocrashreport, --nolockcheck, --accept=socket,host=localhost,port=8100;urp;");
Code for convertion ms office file to pdf in simplified form is:
public void convertFile(){
xRemoteContext = BootstrapSocketConnector.bootstrap(oooPath);
xRemoteServiceManager = xRemoteContext.getServiceManager();
Object desktop = null;
desktop = xRemoteServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", xRemoteContext);
xComponentLoader = (XComponentLoader) UnoRuntime.queryInterface(XComponentLoader.class, desktop);
File mfile = new File(workingDir + myTemplate);
String sUrl = pathToURL(workingDir + myTemplate);
PropertyValue[] propertiesIn;
propertiesIn = new PropertyValue[2];
propertiesIn[0] = property("ReadOnly", Boolean.TRUE);
propertiesIn[1] = property("Hidden", Boolean.TRUE);
XComponent xComp = null;
try {
//xComp = xComponentLoader.loadComponentFromURL(sUrl, "_blank", 0, propertiesIn);
//The same via timeout control
xComp = callLibreLoad(sUrl, propertiesIn);
}
catch (TimeoutException ex) {
if(xComp!= null)
xComp.dispose();
...
}
// save as a PDF
XStorable xStorable = (XStorable) UnoRuntime.queryInterface(XStorable.class, xComp);
PropertyValue[] propertiesOut = new PropertyValue[2];
propertiesOut[0] = property("FilterName", formatfilter);
propertiesOut[1] = property("Overwrite", Boolean.TRUE);
String myResult = workingDir + fileNameOut;
try {
//xStorable.storeToURL(pathToURL(myResult), propertiesOut);
//The same via timeout control
callLibreStore(xStorable,pathToURL(myResult), propertiesOut);
}
catch (TimeoutException ex) {
if(xComp!= null)
xComp.dispose();
...
}
if(xComp!= null)
xComp.dispose();
}
Functions callLibreLoad and callLibreStore use Future interface for timeout control:
private XComponent callLibreLoad(String sUrl, PropertyValue[] propertiesIn) throws Exception {
XComponent result = null;
ExecutorService executor = Executors.newCachedThreadPool();
Callable<Object> task = new Callable<Object>() {
public Object call() throws IllegalArgumentException, com.sun.star.io.IOException {
return xComponentLoader.loadComponentFromURL(sUrl, "_blank", 0, propertiesIn);
}
};
Future<Object> future = executor.submit(task);
try {
result = (XComponent) future.get(maxTimeout, TimeUnit.SECONDS);
}
finally
{ future.cancel(true);}
return result;
}
private void callLibreStore(XStorable xStorable, String sUrl, PropertyValue[] propertiesOut) throws Exception {
Integer result = null;
ExecutorService executor = Executors.newCachedThreadPool();
Runnable task = new Runnable() {
public void run() {
try {
xStorable.storeToURL(sUrl, propertiesOut);
} catch (com.sun.star.io.IOException e) {
log.error(e);
}
}
};
Future future = executor.submit(task);
try {
future.get(maxTimeout, TimeUnit.SECONDS);
}
finally {
future.cancel(true); // may or may not desire this
}
}
So, when timeout exception take place in function callLibreLoad, SOffice process is restored to working state at once.
But when timeout take place later, in function callLibreStore, even after timeout happens and convertion thread was intrerrupted, SOffice process stays in 100% load state for more than a minute trying to dispose loaded document, executing code xComp.dispose().
At this period the stack trace of java thread with SOffice process contains following:
State: WAITING on com.sun.star.lib.uno.environments.remote.JobQueue#30af74b8
Total blocked: 455 Total waited: 1 967
Stack trace:
java.lang.Object.wait(Native Method)
com.sun.star.lib.uno.environments.remote.JobQueue.removeJob(JobQueue.java:207)
com.sun.star.lib.uno.environments.remote.JobQueue.enter(JobQueue.java:316)
com.sun.star.lib.uno.environments.remote.JobQueue.enter(JobQueue.java:289)
com.sun.star.lib.uno.environments.remote.JavaThreadPool.enter(JavaThreadPool.java:81)
com.sun.star.lib.uno.bridges.java_remote.java_remote_bridge.sendRequest(java_remote_bridge.java:618)
com.sun.star.lib.uno.bridges.java_remote.ProxyFactory$Handler.request(ProxyFactory.java:145)
com.sun.star.lib.uno.bridges.java_remote.ProxyFactory$Handler.invoke(ProxyFactory.java:129)
com.sun.proxy.$Proxy211.close(Unknown Source)
com.componentplus.prom.libreoffice.LibreOfficeStationary.closeDocument(LibreOfficeStationary.java:425)
com.componentplus.prom.libreoffice.LibreOfficeStationary.convertFile(LibreOfficeStationary.java:393)
...
How is it possible to force Soffice to cancel convertion to pdf in case it takes more than maximum permited time.
One possibility might be to save the Process instance returned by Runtime.getRuntime().exec, e.g. in a variable myProc, and then call myProc.destroy() to kill the process when needed.
This is actually a design question / problem. And I am not sure if writing and reading the file is an ideal solution here. Nonetheless, I will outline what I am trying to do below:
I have the following static method that once the reqStreamingData method of obj is called, it starts retrieving data from client server constantly at a rate of 150 milliseconds.
public static void streamingDataOperations(ClientSocket cs) throws InterruptedException, IOException{
// call - retrieve streaming data constantly from client server,
// and write a line in the csv file at a rate of 150 milliseconds
// using bufferedWriter and printWriter (print method).
// Note that the flush method of bufferedWriter is never called,
// I would assume the data is in fact being written in buffered memory
// not the actual file.
cs.reqStreamingData(output_file); // <- this method comes from client's API.
// I would like to another thread (aka data processing thread) which repeats itself every 15 minutes.
// I am aware I can do that by creating a class that extends TimeTask and fix a schedule
// Now when this thread runs, there are things I want to do.
// 1. flush last 15 minutes of data to the output_file (Note no synchronized statement method or statements are used here, hence no object is being locked.)
// 2. process the data in R
// 3. wait for the output in R to come back
// 4. clear file contents, so that it always store data that only occurs in the last 15 minutes
}
Now, I am not well versed in multithreading. My concern is that
The request data thread and the data processing thread are reading and writing to the file simultaneously but at a different rate, I am
not sure if the data processing thread would delay the request data thread
by a significant amount, since the data processing have more computational heavy task to carry out than the request data thread. But given that they are 2 separate threads, would any error or exception occur here ?
I am not too supportive of the idea of writing and reading the same file at the same time but because I have to use R to process and store the data in R's dataframe in real time, I really cannot think of other ways to approach this. Are there any better alternatives ?
Is there a better design to tackle this problem ?
I understand that this is a lengthy problem. Please let me know if you need more information.
The lines (CSV, or any other text) can be written to a temporary file. When processing is ready to pick up, the only synchronization needed occurs when the temporary file is getting replaced by the new one. This guarantees that the producer never writes to the file that is being processed by the consumer at the same time.
Once that is done, producer continues to add lines to the newer file. The consumer flushes and closes the old file, and then moves it to the file as expected by your R-application.
To further clarify the approach, here is a sample implementation:
public static void main(String[] args) throws IOException {
// in this sample these dirs are supposed to exist
final String workingDirectory = "./data/tmp";
final String outputDirectory = "./data/csv";
final String outputFilename = "r.out";
final int addIntervalSeconds = 1;
final int drainIntervalSeconds = 5;
final FileBasedTextBatch batch = new FileBasedTextBatch(Paths.get(workingDirectory));
final ScheduledExecutorService executor = Executors.newScheduledThreadPool(1);
final ScheduledFuture<?> producer = executor.scheduleAtFixedRate(
() -> batch.add(
// adding formatted date/time to imitate another CSV line
LocalDateTime.now().format(DateTimeFormatter.ISO_DATE_TIME)
),
0, addIntervalSeconds, TimeUnit.SECONDS);
final ScheduledFuture<?> consumer = executor.scheduleAtFixedRate(
() -> batch.drainTo(Paths.get(outputDirectory, outputFilename)),
0, drainIntervalSeconds, TimeUnit.SECONDS);
try {
// awaiting some limited time for demonstration
producer.get(30, TimeUnit.SECONDS);
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
catch (ExecutionException e) {
System.err.println("Producer failed: " + e);
}
catch (TimeoutException e) {
System.out.println("Finishing producer/consumer...");
producer.cancel(true);
consumer.cancel(true);
}
executor.shutdown();
}
static class FileBasedTextBatch {
private final Object lock = new Object();
private final Path workingDir;
private Output output;
public FileBasedTextBatch(Path workingDir) throws IOException {
this.workingDir = workingDir;
output = new Output(this.workingDir);
}
/**
* Adds another line of text to the batch.
*/
public void add(String textLine) {
synchronized (lock) {
output.writer.println(textLine);
}
}
/**
* Moves currently collected batch to the file at the specified path.
* The file will be overwritten if exists.
*/
public void drainTo(Path targetPath) {
try {
final long startNanos = System.nanoTime();
final Output output = getAndSwapOutput();
final long elapsedMillis =
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
System.out.printf("Replaced the output in %d millis%n", elapsedMillis);
output.close();
Files.move(
output.file,
targetPath,
StandardCopyOption.ATOMIC_MOVE,
StandardCopyOption.REPLACE_EXISTING
);
}
catch (IOException e) {
System.err.println("Failed to drain: " + e);
throw new IllegalStateException(e);
}
}
/**
* Replaces the current output with the new one, returning the old one.
* The method is supposed to execute very quickly to avoid delaying the producer thread.
*/
private Output getAndSwapOutput() throws IOException {
synchronized (lock) {
final Output prev = this.output;
this.output = new Output(this.workingDir);
return prev;
}
}
}
static class Output {
final Path file;
final PrintWriter writer;
Output(Path workingDir) throws IOException {
// performs very well on local filesystems when working directory is empty;
// if too slow, maybe replaced with UUID based name generation
this.file = Files.createTempFile(workingDir, "csv", ".tmp");
this.writer = new PrintWriter(Files.newBufferedWriter(this.file));
}
void close() {
if (this.writer != null)
this.writer.flush();
this.writer.close();
}
}
I am running a thread to traverse my local directory (no sub directory) and as soon as I am getting a text file, I am starting a new thread which will search a word in that file.
What is wrong in the below code?
Searching and traversing are working fine, separately. But when I am putting it together, some thing is going wrong, it is skipping some files (Not exactly, due to multithreading object sunchronization is not happening properly).
Please help me out.
Traverse.java
public void executeTraversing() {
Path dir = null;
if(dirPath.startsWith("file://")) {
dir = Paths.get(URI.create(dirPath));
} else {
dir = Paths.get(dirPath);
}
listFiles(dir);
}
private synchronized void listFiles(Path dir) {
ExecutorService executor = Executors.newFixedThreadPool(1);
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path file : stream) {
if (Files.isDirectory(file)) {
listFiles(file);
} else {
search.setFileNameToSearch(file);
executor.submit(search);
}
}
} catch (IOException | DirectoryIteratorException x) {
// IOException can never be thrown by the iteration.
// In this snippet, it can only be thrown by
// newDirectoryStream.
System.err.println(x);
}
}
Search.java
/**
* #param wordToSearch
*/
public Search(String wordToSearch) {
super();
this.wordToSearch = wordToSearch;
}
public void run() {
this.search();
}
private synchronized void search() {
counter = 0;
Charset charset = Charset.defaultCharset();
try (BufferedReader reader = Files.newBufferedReader(fileNameToSearch.toAbsolutePath(), charset)) {
// do you have permission to read this directory?
if (Files.isReadable(fileNameToSearch)) {
String line = null;
while ((line = reader.readLine()) != null) {
counter++;
//System.out.println(wordToSearch +" "+ fileNameToSearch);
if (line.contains(wordToSearch)) {
System.out.println("Word '" + wordToSearch
+ "' found at "
+ counter
+ " in "
+ fileNameToSearch);
}
}
} else {
System.out.println(fileNameToSearch
+ " is not readable.");
}
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
}
this Search instance that you keep reusing here:
search.setFileNameToSearch(file);
executor.submit(search);
while its actual search() method is synchronized, it appears like by the time it actually gets to searching something setFileNameToSearch() would have been called several times, which would explain the skipping.
create a new instance of Search each time, then you wouldnt need to sync the actual search() function.
You are creating the ExecutorService inside your listFiles method, this is probably not a good idea: because of that you're probably creating too many threads.
On top of that you're not monitoring the state of all these ExecutorServices, some of them might not be started when you application stops
Instead you should create the ExecutorService only once, before starting the recursion. When the recursion is over, call shutdown() on your ExecutorService to wait for all tasks completion
Furthermore you are reusing a Search object and passing it to mutliple tasks while modifying it, you should create a Search for each file you're processing
I am looking to read the contents of a file in Java. I have about 8000 files to read the contents and have it in HashMap like (path,contents). I think using Threads would be a option for doing this to speed up the process.
From what I know having all 8000 files to read their contents in different threads is not possible(we may want to limit the threads),Any comments on it? Also I am new to threading in Java, can any one help on how to get started on this one?
so far I thought this pesudo code, :
public class ThreadingTest extends Thread {
public HashMap<String, String > contents = new HashMap<String, String>();
public ThreadingTest(ArrayList<String> paths)
{
for(String s : paths)
{
// paths is paths to files.
// Have threading here for each path going to get contents from a
// file
//Not sure how to limit and start threads here
readFile(s);
Thread t = new Thread();
t.start();
}
}
public String readFile(String path) throws IOException
{
FileReader reader = new FileReader(path);
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(reader);
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
return textOnly;
}
}
Any help in completing the threading process. Thanks
Short answer: Read the files sequentially. Disk I/O doesn't parallelize well.
Long Answer: Threading might improve the read performance if the disks are good at random access (SSD disks are) or if the files are placed on several different disks, but if they're not you're just likely to end up with a lot of cache misses and waiting for the disks to seek the right read position. (You may still end up there even if your disks are good at random access.)
If you want to measure instead of guess, use Executors.newFixedThreadPool to create an ExecutorService which can read your files in parallell. Experiment with different thread counts, but don't be surprised if one reader thread per physical disk gives you the best performance.
This is a typical task for thread pool. See the tutorial here: http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html
import java.io.*;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.*;
public class PooledFileProcessing {
private Map<String, String> contents = Collections.synchronizedMap(new HashMap<String, String>());
// Integer.MAX_VALUE items max
private LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<Runnable>();
private ExecutorService executor = new ThreadPoolExecutor(
5, // five workers by default
20, // up to twenty workers
1, TimeUnit.MINUTES, // idle thread dies in one minute
workQueue
);
public void process(final String basePath) {
visit(new File(basePath));
System.out.println(workQueue.size() + " jobs still in queue");
executor.shutdown();
try {
executor.awaitTermination(5, TimeUnit.MINUTES);
} catch (InterruptedException e) {
System.out.println("interrupted while awaiting termination");
}
System.out.println(contents.size() + " files indexed");
}
public void visit(final File file) {
if (!file.exists()) {
return;
}
if (file.isFile()) { // skip the dirs
executor.submit(new RunnablePullFile(file));
}
// traverse children
if (file.isDirectory()) {
final File[] children = file.listFiles();
if (children != null && children.length > 0) {
for (File child : children) {
visit(child);
}
}
}
}
public static void main(String[] args) {
new PooledFileProcessing().process(args.length == 1 ? args[0] : System.getProperty("user.home"));
}
protected class RunnablePullFile implements Runnable {
private final File file;
public RunnablePullFile(File file) {
this.file = file;
}
public void run() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line;
while (
(line=reader.readLine()) != null &&
sb.length() < 8192 /* remove this check for a nice OOME or swap thrashing */
) {
sb.append(line);
}
contents.put(file.getPath(), sb.toString());
} catch (IOException e) {
System.err.println("failed on file: '" + file.getPath() + "': " + e.getMessage());
if (reader != null) {
try {
reader.close();
} catch (IOException e1) {
// ignore that one
}
}
}
}
}
}
From my experience, threading helps - use a thread pool and play with values around 1..2 threads per core.
Just take care with the hash map - consider putting data to the map via a synchronized method only. I remember I once had some ugly issues in similiar project and they were related to concurrent modifications of a central hash map.
just some quick tips.
First of all, to get you started on threads, you should just look at the Runnable interface, or the Thread class. To make a thread you either have to implement this interface with a class or extend this class with another class. You can also make anonymous threads too, but I dislike the readability of those unless its something SUPER simple.
Next, just some notes on processing text with multiple threads, because it just so happens I have some experience in exactly this! Keep in mind that if the files are large and take a noticeably long time to process a single file that you will want to monitor your CPU. In my experience I was doing lots of calculations and lookups when I was processing which added hugely to my load so in the end I found that I could only make as many threads as I had processors because each thread was so labor intensive. So keep that in mind, you want to monitor the effect each thread has on the processor.
I'm not sure having threads for this would really speed up the process if all the files are on the same physical disk. It could even slow things down because the disk would have to constantly switch from one location to the other.