Multiple threads reading from same file/odd behavior - java

I'm writing a program that needs to read lines from a very large file (400K+ lines) and send the data in each line on to a web service. I decided to try threading and am seeing some behavior I did not expect, it appears like my BufferedReader starts reusing lines it's already given me when I call readline() on it.
My program is made up of two classes. A "Main" class that kicks off the threads and holds a static reference to the BufferedReader and has a static sychronized "readNextLine()" method that the threads can use to basically call readLine() on the BufferedReder. And the "Runnable" class that calls readNextLine() and makes a webservice call with the data from each readNextLine() call. I made the BufferedReader and readNextLine() static just because that's the only way I could think of for the threads to share the reader aside from passing an instance of my main class into the threads, I wasn't sure which was better.
After about 5 minutes, I start seeing errors in my web service saying that it's processing a line it's already processed. I'm able to verify lines are indeed being sent multiple times, minutes apart.
Does anyone have any ideas as to why the BufferedReader seems to be giving the threads lines it already read? I was under the impression readline() was sequential and all I needed to do was make sure the calls to readline() were synchronized.
I'll show some of the Main class code below. The runnable is essentially a while loop that calls readNextLine() and processes each line until there are no more lines left.
Main class:
//showing reader and thread creation
inputStream = sftp.get(path to file);
reader = new BufferedReader(new InputStreamReader(inputStream));
ExecutorService executor = Executors.newFixedThreadPool(threads);
Collection<Future> futures = new ArrayList<Future>();
for(int i=0;i<threads;i++){
MyRunnable runnable = new MyRunnable(i);
futures.add(executor.submit(runnable));
}
LOGGER.debug("futures.get()");
for(Future f:futures){
f.get(); //use to wait until all threads are done
}
public synchronized static String readNextLine(){
String results = null;
try{
if(reader!=null){
results = reader.readLine();
}
}catch(Exception e){
LOGGER.error("Error reading from file");
}
return results;
}

I'm testing what you said, but I found you get an error logic in your readNextLine() method, how can reader.readLine() be invoked as the results is null and the if condition is it is not null?
Now I finished my demo, and it seems it works well, the following is the demo, no re-read line happened:
static BufferedReader reader;
public static void main(String[] args) throws FileNotFoundException, ExecutionException, InterruptedException {
reader = new BufferedReader(new FileReader("test.txt"));
ExecutorService service = Executors.newFixedThreadPool(3);
List<Future> results = new ArrayList<Future>();
for (int i = 0; i < 3; i++) {
results.add(service.submit(new Runnable() {
#Override
public void run() {
try {
String line = null;
while ((line = readNextLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
}));
}
}
public synchronized static String readNextLine() throws IOException {
return reader.readLine();
}

Related

Process large text file concurrently

So I have a large text file, in this case it's roughly 4.5 GB, and I need to process the entire file as fast as is possible. Right now I have multi-threaded this using 3 threads (not including the main thread). An input thread for reading the input file, a processing thread to process the data, and an output thread to output the processed data to a file.
Currently, the bottleneck is the processing section. Therefore, I'd like to add more processing threads into the mix. However, this creates a situation where I've got multiple threads accessing the same BlockingQueue, and their results are therefore not maintaining the order of the input file.
An example of the functionality I'm looking for would be something like this:
Input file: 1, 2, 3, 4, 5
Output file: ^ the same. Not 2, 1, 4, 3, 5 or any other combination.
I've written a dummy program that is identical in functionality to the actual program minus the processing part, (I can't give you the actual program due to the processing class containing info that is confidential). I should also mention, all of the classes (Input, Processing, and Output) are all Inner classes contained within a Main class that contains the initialise() method and the class level variables mentioned in the main thread code listed below.
Main thread:
static volatile boolean readerFinished = false; // class level variables
static volatile boolean writerFinished = false;
private void initialise() throws IOException {
BlockingQueue<String> inputQueue = new LinkedBlockingQueue<>(1_000_000);
BlockingQueue<String> outputQueue = new LinkedBlockingQueue<>(1_000_000); // capacity 1 million.
String inputFileName = "test.txt";
String outputFileName = "outputTest.txt";
BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
BufferedWriter writer = new BufferedWriter(new FileWriter(outputFileName));
Thread T1 = new Thread(new Input(reader, inputQueue));
Thread T2 = new Thread(new Processing(inputQueue, outputQueue));
Thread T3 = new Thread(new Output(writer, outputQueue));
T1.start();
T2.start();
T3.start();
while (!writerFinished) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
reader.close();
writer.close();
System.out.println("Exited.");
}
Input thread: (Please forgive the commented debug code, was using it to ensure the reader thread was actually executing properly).
class Input implements Runnable {
BufferedReader reader;
BlockingQueue<String> inputQueue;
Input(BufferedReader reader, BlockingQueue<String> inputQueue) {
this.reader = reader;
this.inputQueue = inputQueue;
}
#Override
public void run() {
String poisonPill = "ChH92PU2KYkZUBR";
String line;
//int linesRead = 0;
try {
while ((line = reader.readLine()) != null) {
inputQueue.put(line);
//linesRead++;
/*
if (linesRead == 500_000) {
//batchesRead += 1;
//System.out.println("Batch read");
linesRead = 0;
}
*/
}
inputQueue.put(poisonPill);
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
readerFinished = true;
}
}
Processing thread: (Normally this would actually be doing something to the line, but for purposes of the mockup I've just made it immediately push to the output thread). If necessary we can simulate it doing some work by making the thread sleep for a small amount of time for each line.
class Processing implements Runnable {
BlockingQueue<String> inputQueue;
BlockingQueue<String> outputQueue;
Processing(BlockingQueue<String> inputQueue, BlockingQueue<String> outputQueue) {
this.inputQueue = inputQueue;
this.outputQueue = outputQueue;
}
#Override
public void run() {
while (true) {
try {
if (inputQueue.isEmpty() && readerFinished) {
break;
}
String line = inputQueue.take();
outputQueue.put(line);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
Output thread:
class Output implements Runnable {
BufferedWriter writer;
BlockingQueue<String> outputQueue;
Output(BufferedWriter writer, BlockingQueue<String> outputQueue) {
this.writer = writer;
this.outputQueue = outputQueue;
}
#Override
public void run() {
String line;
ArrayList<String> outputList = new ArrayList<>();
while (true) {
try {
line = outputQueue.take();
if (line.equals("ChH92PU2KYkZUBR")) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer finished - executing termination");
writerFinished = true;
break;
}
line += "\n";
outputList.add(line);
if (outputList.size() == 500_000) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer wrote batch");
outputList = new ArrayList<>();
}
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
}
}
}
So right now the general data flow is very linear, looking something like this:
Input > Processing > Output.
But what I'd like to have is something like this:
But the catch is, when the data gets to output, it either needs to be sorted into the correct order, or it needs to already be in the correct order.
Recommendations or examples on how to go about this would be greatly appreciated.
In the past I have used the Future and Callable interfaces to solve a task involving parallel data flows like this, but unfortunately that code was not reading from a single queue, and so is of minimal help here.
I should also add, for those of you that will notice this, batchSize and poisonPill are normally defined in the main thread and then passed around via variables, they are not usually hard coded as they are in the code for Input thread, and the output checks for the writer thread. I was just a wee bit lazy when writing the mockup for experimentation at ~1am.
Edit: I should also mention, this is required to use Java 8 at most. Java 9 features and above cannot be used due to these versions not being installed in the environments in which this program will be run.
What you could do:
Take X threads for processing, where X is the number of cores available for processing
Give each thread its own input queue.
The reader thread gives records to each thread's input queue round-robin in a predictable fashion.
Since the output files are too big for memory, you write X output files, one for each thread, and each file name has the index of the thread in it, so that you can reconstitute the original order from the file names.
After the process is complete, you merge the X output files. One line from the file for thread 1, one from the files for thread 2, etc. in a round-robin fashion again. This reconstitutes the original order.
As an added bonus, since you have an input queue per thread, you don't have lock contention on the queue between readers. (only between the reader and the writer) You could even optimize this by putting things in the input queues in batches larger than 1.
As was also proposed by Alexei, you can create OrderedTask:
class OrderedTask implements Comparable<OrderedTask> {
private final Integer index;
private final String line;
public OrderedTask(Integer index, String line) {
this.index = index;
this.line = line;
}
#Override
public int compareTo(OrderedTask o) {
return index < o.getIndex() ? -1 : index == o.getIndex() ? 0 : 1;
}
public Integer getIndex() {
return index;
}
public String getLine() {
return line;
}
}
As an output queue you can use your own backed by priority queue:
class OrderedTaskQueue {
private final ReentrantLock lock;
private final Condition waitForOrderedItem;
private final int maxQueuesize;
private final PriorityQueue<OrderedTask> backedQueue;
private int expectedIndex;
public OrderedTaskQueue(int maxQueueSize, int startIndex) {
this.maxQueuesize = maxQueueSize;
this.expectedIndex = startIndex;
this.backedQueue = new PriorityQueue<>(2 * this.maxQueuesize);
this.lock = new ReentrantLock();
this.waitForOrderedItem = this.lock.newCondition();
}
public boolean put(OrderedTask item) {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.size() >= maxQueuesize && item.getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
boolean result = this.backedQueue.add(item);
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
public OrderedTask take() {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.peek() == null || this.backedQueue.peek().getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
OrderedTask result = this.backedQueue.poll();
expectedIndex++;
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
}
StartIndex is the index of the first ordered task, and
maxQueueSize is used to stop processing of other tasks (not to fill the memory), when we wait for some earlier task to finish. It should be double/tripple of the number of processing thread, to not stop the processing immediatelly and allow the scalability.
Then you should create your task :
int indexOrder =0;
while ((line = reader.readLine()) != null) {
inputQueue.put(new OrderedTask(indexOrder++,line);
}
The line by line is only used because of your example. You should change the OrderedTask to support the batch of lines.
Why not reverse the flow ?
Output call for X batches;
Generate X promise/task (promise pattern) who will call randomly one of the processing core (keep a batch number, to pass through to the input core); batch the calls handler into a ordered list;
Each processing core call for a batch in the input core;
Enjoy ?

Using a threadpool to add in to a list

I am trying to read a file and add each line to a list.
Simple drawing explaining the goal
Main class -
public class SimpleTreadPoolMain {
public static void main(String[] args) {
ReadFile reader = new ReadFile();
File file = new File("C:\\myFile.csv");
try {
reader.readFile(file);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Reader class -
public class ReadFile {
ExecutorService executor = Executors.newFixedThreadPool(5);//creating a pool of 5 threads
List<String> list = new ArrayList<>();
void readFile(File file) throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != "") {
Runnable saver = new SaveToList(line,list);
executor.execute(saver);//calling execute method of ExecutorService
}
}
executor.shutdown();
while (!executor.isTerminated()) { }
}
}
Saver class -
public class SaveToList<E> implements Runnable{
List<E> myList;
E line;
public SaveToList(E line, List<E> list) {
this.line = line;
this.myList = list;
}
public void run() {
//modify the line
myList.add(line);
}
}
I tried to have many saver threads to add in to a same list instead of one saver adding to the list one by one. I want to use threads because I need to modify the data before adding to the list. So I assume modifying the data would take up some time. So paralleling this part would reduce the time consumption, right?
But this doesn't work. I am unable to return a global list which includes all the values from the file. I want to have only one global list of values from the file. So the code definitely should change. If one can guide me it would be greatly appreciated.
Even though adding one by one in a single thread would work, using a thread pool would make it faster, right?
Using multiple threads won't speed anything up here.
You are:
Reading a line from a file, serially.
Creating a runnable and submitting it into a thread pool
The runnable then adds things into a list
Given that you're using an ArrayList, you need to synchronize access to it, because you're mutating it from multiple threads. So, you are adding things into the list serially.
But even without the synchronization, the time taken for the IO will far exceed the time taken to add the string into the list. And adding in multithreading is just going to slow it down more, because it's doing work to construct the runnable, submit it to the thread pool, schedule it, etc.
It's simpler just to miss out the whole middle step:
Read a line from a file, serially.
Add the list to the list, serially.
So:
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while (!(line = br.readLine()).isEmpty()) {
list.add(line);
}
}
You should in fact try if it's worth using multi threading in you application, just compare how much time it takes to read the whole file without any processing on rows done, and compare it with the time it takes to process serially the whole file.
If your process is not too complex my guess is it is not worth to use multi threading.
If you find that the time it takes is much more then you can think about using one or more threads to do the computations.
If so, you could use Futures to process batches of input strings or maybe you could use a thread safe Queue to send string to another process.
private static final int BATCH_SIZE = 1000;
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("big_file.csv"), "utf-8"));
ExecutorService pool = Executors.newFixedThreadPool(8);
String line;
List<String> batch = new ArrayList<>(BATCH_SIZE);
List<Future> results = new LinkedList<>();
while((line=reader.readLine())!=null){
batch.add(line);
if(batch.size()>=BATCH_SIZE){
Future<Object> f = noWaitExec(batch, pool);
results.add(f);
batch = new ArrayList<>(BATCH_SIZE);
}
}
Future<List> f = noWaitExec(batch,pool);
results.add(f);
for (Future future : results) {
try {
Object object = future.get();
// Use your results here
} catch (Exception e) {
// Manage this....
}
}
}
private static Future<List> noWaitExec(final List<String> batch, ExecutorService pool) {
return pool.submit(new Callable<List>() {
public List call() throws Exception {
List result = new ArrayList<>(batch.size());
for (String string : batch) {
result.add(process(string));
}
return result;
}
});
}
private static Object process(String string) {
// Your process ....
return null;
};
There are many other possible solutions (Observables, ParallelStreams, Pipes, CompletableFutures ... you name it), still I think that most of the time spent is the time it takes to read the file, just using a BufferedInputStream to read the file with a big enough buffer could cut your times more then parallel computing.

How might I test that a CLI app waits for input

This is not the same as this question: JUnit: How to simulate System.in testing?, which is about mocking stdin.
What I want to know is how to test (as in TDD) that a simple Java class with a main method waits for input.
My test:
#Test
public void appRunShouldWaitForInput(){
long startMillis = System.currentTimeMillis();
// NB obviously you'd want to run this next line in a separate thread with some sort of timeout mechanism...
// that's an implementation detail I've omitted for the sake of avoiding clutter!
App.main( null );
long endMillis = System.currentTimeMillis();
assertThat( endMillis - startMillis ).isGreaterThan( 1000L );
}
My SUT main:
public static void main(String args[]) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter something : ");
String input = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
... test fails. The code does not wait. But when you run the app at the command prompt it does indeed wait.
NB by the way I did also try with setting stdin to sthg else:
System.setIn(new ByteArrayInputStream( dummy.getBytes()));
scanner = new Scanner(System.in);
... this did not hold up the test.
As a much more general rule, static methods (such as main methods) are difficult to test. For this reason, you almost never call the main method (or any other static method) from your test code. A common pattern to work around this is to convert this:
public class App {
public static void main(String args[]) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter something : ");
String input = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
}
}
to this:
public class App {
private final InputStream input;
private final OutputStream output;
public App(InputStream input, OutputStream output) {
this.input = input;
this.output = output;
}
public static void main(String[] args) {
new App(System.in, System.out).start();
}
public void start() {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(input));
output.print("Enter something : ");
String nextInput = br.readLine();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Now your test becomes this:
#Test
public void appRunShouldWaitForInput(){
ByteArrayOutputStream output = new ByteArrayOutputStream();
// As you have already noted, you would need to kick this off on another thread and use a blocking implementation of InputStream to test what you want to test.
new App(new ByteArrayInputStream(), output).start();
assertThat(output.toByteArray().length, is(0));
}
The key idea is that, when you run the app "for real" i.e. via the main method, it will use the Standard input and output streams. However, when you run it from your tests, it uses a purely in memory input/output stream which you have full control over in your test. ByteArrayOutputStream is just one example, but you can see in my example that the test is able to inspect the actual bytes that have been written to the output stream.
in general: you can't test if your program is waiting until something happens (because you can't test if it waits forever)
what is usually done in such case:
in your case: don't test it. readLine is provided by external library that was tested pretty intensive. cost of testing this is much higher then the value of such test. refactor and test your business code (string/stream operations), not infrastructure (system input)
in general case it's testing concurrent programming. it's hard so ppl usually try to find some useful simplification:
you can just test if output of your program is correct for the input provided in tests.
for some really hard problems above technique is combined with running test thousands times (and forcing thread switching if possible) to detect errors in concurrent programming (violated invariants)
before your test provides an input date, you can test if your program is in correct state (waiting thread, correct field values etc)
in worst case, you can use delay in tests before providing input to be sure, your program waits and uses the input. this technique is often used because it's simple but it's smelly and if you add more tests like that your whole suit gets slower

Exec'ing multiple processes from Java: outputs mixed up?

I have a servlet which creates a new Action object inside doGet(), and this object uses exec() to run external processes. There may be several requests to the servlet at the same time, so I might have several Action objects where each one is running an external process at the same time. Occasionally when this happens I find that the output from one process gets mixed up with the output from one of the others.
Each Action creates a unique temporary directory and runs the process with this as its current directory. Separate threads then read the output streams from the process into a string buffer. The code that the Action object executes looks basically like this:
Process proc = null;
ReaderThread stdout = null;
ReaderThread stderr = null;
StringBuffer buff = new StringBuffer();
int exit = -1;
try {
//
// Run the process
//
proc = Runtime.getRuntime().exec(command,null,directory);
//
// Read the output from the process
//
stdout = new ReaderThread(proc.getInputStream(),buff);
stderr = new ReaderThread(proc.getErrorStream(),buff);
stdout.start();
stderr.start();
//
// Get the exit code
//
exit = proc.waitFor();
//
// Wait for all the output to be read
//
stdout.join();
stderr.join();
}
catch (InterruptedException e) {
if (proc != null) {
proc.destroy();
}
}
catch (Exception e) {
buff.append(e.getClass() + ": " + e.getMessage());
if (proc != null) {
proc.destroy();
}
}
So each request uses a separate Action object to run a process, and this has its own StringBuffer "buff" that the output of the process is accumulated into by the two ReaderThreads. But what I find is that, when two requests are running two processes at the same time, the output of one will sometimes end up in the StringBuffer of the thread that is running the other one, and one of the two servlet requests will see output intended for the other one. It basically behaves as if Runtime.exec() provides a single global pipe to which the output streams of all the processes are connected.
The ReaderThread looks like this:
public class ReaderThread extends Thread {
private BufferedReader reader;
private StringBuffer buffer;
public ReaderThread (InputStream stream, StringBuffer buffer) {
this.reader = new BufferedReader(new InputStreamReader(stream));
this.buffer = buffer;
}
#Override
public void run () {
try {
String line;
while ((line = reader.readLine()) != null) {
synchronized (buffer) {
buffer.append(line + "\n");
}
}
}
catch (IOException e) {
synchronized (buffer) {
buffer.append(e.getMessage() + "\n");
}
}
}
}
Can anyone suggest what I can do to fix this?
Use a ThreadLocal variable to store the output of each thread.
When and how should I use a ThreadLocal variable?
Here's the explanation, partly as a cautionary tale:
the output was being added to an ArrayList in the XML node that
accessed it
the XML nodes were created by cloning prototypes
the ArrayList was initialised in the declaration, not in the initialise()
method of the class, so every instance ended up referring to the same
object.
Duh.
Another two days of my life down the drain!
Happy new year...

Java read serial data for given time period.

I have a function that reads serial data from an embedded device. My program shows a picture and a title and basically the device acts as a buzzer for a game. Is there a way to check for serial data for lets say 5 seconds and if nothing was received to continue with the code (go to the next picture and title). My current function looks like this.
public String getUARTLine(){
String inputLine = null;
try{
BufferedReader input = new BufferedReader(new InputStreamReader(serialPort.getInputStream()));
inputLine = input.readLine();
if (inputLine.length() == 0)
return null;
} catch (IOException e){
//System.out.println("IOException: " + e);
return null;
}
return inputLine;
}
You can start reading data from serialPort and start a timer in other thread. Something like this:
class ReadItWithTimeLimit implements Runnable {
int miliSeconds;
BufferedReader reader;
public ReadItWithTimeLimit (BufferedReader reader, int miliSeconds) {
this.miliSeconds = miliSeconds;
this.reader = reader;
}
public void run() {
Thread.sleep(miliSeconds);
this.reader.close();
}
}
So you can call it from your code:
// ...
BufferedReader input = new BufferedReader(new InputStreamReader(serialPort.getInputStream()));
new Thread(new ReadItWithTimeLimit(input, 5000)).start();
inputLine = input.readLine();
// ...
This code is without excaption handling, so it requires some finalization work...
Drop the buffer. Start a read on the input stream yourself and in a different thread count 5 seconds. After that, close the stream (that will cause the read function to return -1).
Yes you can. You can use a separate timer thread that triggers the timeout that closes the input stream (that will cause input.readLine() to come back with an IOException). Or you can use java.nio. However I personally prefer the first method.

Categories