One Producer ten consumers file-processing with Executors.newSingleThreadExecutor() - java

I have a LinkedBlockingQueue with an arbitrarily picked capacity of 10, and an input file with 1000 lines. I have one ExecutorService-type variable in the main method of the service class that, to my knowledge, first handles--using Executors.newSingleThreadExecutor()--a single thread to call buffer.readline() until file line == null, and then handles--within a loop using Executors.newSingleThreadExecutor()--ten threads to process lines and write them to output files, until !queue.take().equals("Stop"). However, after writing some lines to files, when I am in the debug mode, I see that the capacity of the queue eventually reaches max (10), and the processing threads do not execute queue.take(). All threads are in the running state, but the process halts after queue.put(). What would cause this problem, and is it solvable using some combination of thread-pooling or multiple ExecutorServicehandler variables, instead of a single variable?
Outline for current state of main method in service:
//app settings to get values for keys within a properties file
AppSettings appSettings = new AppSettings();
BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10);
maxProdThreads = 1;
maxConsThreads = 10;
ExecutorService execSvc = null;
for (int i = 0; i < maxProdThreads; i++) {
execSvc = Executors.newSingleThreadExecutor();
execSvc.submit(new ReadJSONMessage(appSettings,queue));
}
for (int i = 0; i < maxConsThreads; i++) {
execSvc = Executors.newSingleThreadExecutor();
execSvc.submit(new ProcessJSONMessage(appSettings,queue));
}
Reading method code:
buffer = new BufferedReader(new FileReader(inputFilePath));
while((line = buffer.readLine()) != null){
line = line.trim();
queue.put(line);
}
Processing and Writing code:
while(!(line=queue.take()).equals("Stop")){
if(line.length() > 10)
{
try {
if(processMessage(line, outputFilePath) == true)
{
++count;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public boolean processMessage(String line, String outputFilePath){
CustomObject cO = new CustomObject();
cO.setText(line);
writeToFile1(cO,...);
writeToFile2(cO,...);
}
public void writeOutputAToFile(CustomObject cO,...){
synchronized(cO){
...
org.apache.commons.io.FileUtils.writeStringToFile(...)
}
}
public void writeOutputBToFile(CustomObject cO,...){
synchronized(cO){
...
org.apache.commons.io.FileUtils.writeStringToFile(...)
}
}

In the Processing and writing code..ensure that all resources are closed properly..Probably the resources might not be closed properly due to which the thread keeps running and the ExecutorService can not find an idle thread...

Related

Process large text file concurrently

So I have a large text file, in this case it's roughly 4.5 GB, and I need to process the entire file as fast as is possible. Right now I have multi-threaded this using 3 threads (not including the main thread). An input thread for reading the input file, a processing thread to process the data, and an output thread to output the processed data to a file.
Currently, the bottleneck is the processing section. Therefore, I'd like to add more processing threads into the mix. However, this creates a situation where I've got multiple threads accessing the same BlockingQueue, and their results are therefore not maintaining the order of the input file.
An example of the functionality I'm looking for would be something like this:
Input file: 1, 2, 3, 4, 5
Output file: ^ the same. Not 2, 1, 4, 3, 5 or any other combination.
I've written a dummy program that is identical in functionality to the actual program minus the processing part, (I can't give you the actual program due to the processing class containing info that is confidential). I should also mention, all of the classes (Input, Processing, and Output) are all Inner classes contained within a Main class that contains the initialise() method and the class level variables mentioned in the main thread code listed below.
Main thread:
static volatile boolean readerFinished = false; // class level variables
static volatile boolean writerFinished = false;
private void initialise() throws IOException {
BlockingQueue<String> inputQueue = new LinkedBlockingQueue<>(1_000_000);
BlockingQueue<String> outputQueue = new LinkedBlockingQueue<>(1_000_000); // capacity 1 million.
String inputFileName = "test.txt";
String outputFileName = "outputTest.txt";
BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
BufferedWriter writer = new BufferedWriter(new FileWriter(outputFileName));
Thread T1 = new Thread(new Input(reader, inputQueue));
Thread T2 = new Thread(new Processing(inputQueue, outputQueue));
Thread T3 = new Thread(new Output(writer, outputQueue));
T1.start();
T2.start();
T3.start();
while (!writerFinished) {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
reader.close();
writer.close();
System.out.println("Exited.");
}
Input thread: (Please forgive the commented debug code, was using it to ensure the reader thread was actually executing properly).
class Input implements Runnable {
BufferedReader reader;
BlockingQueue<String> inputQueue;
Input(BufferedReader reader, BlockingQueue<String> inputQueue) {
this.reader = reader;
this.inputQueue = inputQueue;
}
#Override
public void run() {
String poisonPill = "ChH92PU2KYkZUBR";
String line;
//int linesRead = 0;
try {
while ((line = reader.readLine()) != null) {
inputQueue.put(line);
//linesRead++;
/*
if (linesRead == 500_000) {
//batchesRead += 1;
//System.out.println("Batch read");
linesRead = 0;
}
*/
}
inputQueue.put(poisonPill);
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
readerFinished = true;
}
}
Processing thread: (Normally this would actually be doing something to the line, but for purposes of the mockup I've just made it immediately push to the output thread). If necessary we can simulate it doing some work by making the thread sleep for a small amount of time for each line.
class Processing implements Runnable {
BlockingQueue<String> inputQueue;
BlockingQueue<String> outputQueue;
Processing(BlockingQueue<String> inputQueue, BlockingQueue<String> outputQueue) {
this.inputQueue = inputQueue;
this.outputQueue = outputQueue;
}
#Override
public void run() {
while (true) {
try {
if (inputQueue.isEmpty() && readerFinished) {
break;
}
String line = inputQueue.take();
outputQueue.put(line);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
Output thread:
class Output implements Runnable {
BufferedWriter writer;
BlockingQueue<String> outputQueue;
Output(BufferedWriter writer, BlockingQueue<String> outputQueue) {
this.writer = writer;
this.outputQueue = outputQueue;
}
#Override
public void run() {
String line;
ArrayList<String> outputList = new ArrayList<>();
while (true) {
try {
line = outputQueue.take();
if (line.equals("ChH92PU2KYkZUBR")) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer finished - executing termination");
writerFinished = true;
break;
}
line += "\n";
outputList.add(line);
if (outputList.size() == 500_000) {
for (String outputLine : outputList) {
writer.write(outputLine);
}
System.out.println("Writer wrote batch");
outputList = new ArrayList<>();
}
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}
}
}
}
So right now the general data flow is very linear, looking something like this:
Input > Processing > Output.
But what I'd like to have is something like this:
But the catch is, when the data gets to output, it either needs to be sorted into the correct order, or it needs to already be in the correct order.
Recommendations or examples on how to go about this would be greatly appreciated.
In the past I have used the Future and Callable interfaces to solve a task involving parallel data flows like this, but unfortunately that code was not reading from a single queue, and so is of minimal help here.
I should also add, for those of you that will notice this, batchSize and poisonPill are normally defined in the main thread and then passed around via variables, they are not usually hard coded as they are in the code for Input thread, and the output checks for the writer thread. I was just a wee bit lazy when writing the mockup for experimentation at ~1am.
Edit: I should also mention, this is required to use Java 8 at most. Java 9 features and above cannot be used due to these versions not being installed in the environments in which this program will be run.
What you could do:
Take X threads for processing, where X is the number of cores available for processing
Give each thread its own input queue.
The reader thread gives records to each thread's input queue round-robin in a predictable fashion.
Since the output files are too big for memory, you write X output files, one for each thread, and each file name has the index of the thread in it, so that you can reconstitute the original order from the file names.
After the process is complete, you merge the X output files. One line from the file for thread 1, one from the files for thread 2, etc. in a round-robin fashion again. This reconstitutes the original order.
As an added bonus, since you have an input queue per thread, you don't have lock contention on the queue between readers. (only between the reader and the writer) You could even optimize this by putting things in the input queues in batches larger than 1.
As was also proposed by Alexei, you can create OrderedTask:
class OrderedTask implements Comparable<OrderedTask> {
private final Integer index;
private final String line;
public OrderedTask(Integer index, String line) {
this.index = index;
this.line = line;
}
#Override
public int compareTo(OrderedTask o) {
return index < o.getIndex() ? -1 : index == o.getIndex() ? 0 : 1;
}
public Integer getIndex() {
return index;
}
public String getLine() {
return line;
}
}
As an output queue you can use your own backed by priority queue:
class OrderedTaskQueue {
private final ReentrantLock lock;
private final Condition waitForOrderedItem;
private final int maxQueuesize;
private final PriorityQueue<OrderedTask> backedQueue;
private int expectedIndex;
public OrderedTaskQueue(int maxQueueSize, int startIndex) {
this.maxQueuesize = maxQueueSize;
this.expectedIndex = startIndex;
this.backedQueue = new PriorityQueue<>(2 * this.maxQueuesize);
this.lock = new ReentrantLock();
this.waitForOrderedItem = this.lock.newCondition();
}
public boolean put(OrderedTask item) {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.size() >= maxQueuesize && item.getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
boolean result = this.backedQueue.add(item);
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
public OrderedTask take() {
ReentrantLock lock = this.lock;
lock.lock();
try {
while (this.backedQueue.peek() == null || this.backedQueue.peek().getIndex() != expectedIndex) {
this.waitForOrderedItem.await();
}
OrderedTask result = this.backedQueue.poll();
expectedIndex++;
this.waitForOrderedItem.signalAll();
return result;
} catch (InterruptedException e) {
throw new RuntimeException();
} finally {
lock.unlock();
}
}
}
StartIndex is the index of the first ordered task, and
maxQueueSize is used to stop processing of other tasks (not to fill the memory), when we wait for some earlier task to finish. It should be double/tripple of the number of processing thread, to not stop the processing immediatelly and allow the scalability.
Then you should create your task :
int indexOrder =0;
while ((line = reader.readLine()) != null) {
inputQueue.put(new OrderedTask(indexOrder++,line);
}
The line by line is only used because of your example. You should change the OrderedTask to support the batch of lines.
Why not reverse the flow ?
Output call for X batches;
Generate X promise/task (promise pattern) who will call randomly one of the processing core (keep a batch number, to pass through to the input core); batch the calls handler into a ordered list;
Each processing core call for a batch in the input core;
Enjoy ?

Increasing Disk Read Throughput By Concurrency

I am trying to read a log file and parse it that consumes only CPU. I have a server that reads a huge text file 230MB/second, just read text file not parse. When i try to parse the text file, using single thread, i can parse the file around 50-70MB/second.
I want to increase my throughput, doing that job concurrency. In this code, i reached 130 MB/second. At the peak, i saw 190MB/second. I tried BlockedQueue, Semaphore, ExecutionService etc. Is there any advice you give me reach at 200MB/second throughput.
public static void fileReaderTestUsingSemaphore(String[] args) throws Exception {
CustomFileReader reader = new CustomFileReader(args[0]);
final int concurrency = Integer.parseInt(args[1]);
ExecutorService executorService = Executors.newFixedThreadPool(concurrency);
Semaphore semaphore = new Semaphore(concurrency,true);
System.out.println("Conccurrency in Semaphore: " + concurrency);
String line;
while ((line = reader.getLine()) != null)
{
semaphore.acquire();
try
{
final String p = line;
executorService.execute(new Runnable() {
#Override
public void run() {
reader.splitNginxLinewithIntern(p); // that is the method which parser string and convert to class.
semaphore.release();
}
});
}
catch (Exception ex)
{
ex.printStackTrace();
}
finally {
semaphore.release();
}
}
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.MINUTES);
System.out.println("ReadByteCount: " + reader.getReadByteCount());
}
You might benefit from the Files.lines() method and the Stream paradigm introduced in Java 8. It will use the systems common fork/join pool. Try this pattern:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
public class LineCounter
{
public static void main(String[] args) throws IOException
{
Files.lines(Paths.get("/your/file/here"))
.parallel()
.forEach(LineCounter::processLine);
}
private static void processLine(String line) {
// do the processing
}
}
Assuming that you don't care about order of lines:
final String MARKER = new String("");
BlockingQueue<String> q = new LinkedBlockingDeque<>(1024);
for (int i = 0; i < concurrency; i++)
executorService.execute(() -> {
for (;;) {
try {
String s = q.take();
if(s == MARKER) {
q.put(s);
return;
}
reader.splitNginxLinewithIntern(s);
} catch (InterruptedException e) {
return;
}
}
});
String line;
while ((line = reader.readLine()) != null) {
q.put(line);
}
q.put(MARKER);
executorService.awaitTermination(10, TimeUnit.MINUTES);
This starts a number of threads that each runs a specific task; that task is to read from the queue and run the split method. The reader just feeds the queue, notifies when it's complete and waits for termination.
If you were to use RxJava2 and rxjava2-extras that would simply be
Strings.from(reader)
.flatMap(str -> Flowable
.just(str)
.observeOn(Schedulers.computation())
.doOnNext(reader::splitNginxLinewithIntern)
)
.blockingSubscribe();
You need to go multi-thread, and you need to have the reader thread delegate the parsing to worker threads, that's clear. The point is how to do this delegating with as little overhead as possible.
#Tassos provided code that looks like a solid improvement.
One more thing you can try is to change the delegation granularity, not delegating every single line individually, but building chunks of e.g. 100 lines, thus reducing the delegating/synchronizing overhead by a factor of 100 (but then needing a String[] array or similar, which shouldn't hurt too much).

Multiple threads reading from same file/odd behavior

I'm writing a program that needs to read lines from a very large file (400K+ lines) and send the data in each line on to a web service. I decided to try threading and am seeing some behavior I did not expect, it appears like my BufferedReader starts reusing lines it's already given me when I call readline() on it.
My program is made up of two classes. A "Main" class that kicks off the threads and holds a static reference to the BufferedReader and has a static sychronized "readNextLine()" method that the threads can use to basically call readLine() on the BufferedReder. And the "Runnable" class that calls readNextLine() and makes a webservice call with the data from each readNextLine() call. I made the BufferedReader and readNextLine() static just because that's the only way I could think of for the threads to share the reader aside from passing an instance of my main class into the threads, I wasn't sure which was better.
After about 5 minutes, I start seeing errors in my web service saying that it's processing a line it's already processed. I'm able to verify lines are indeed being sent multiple times, minutes apart.
Does anyone have any ideas as to why the BufferedReader seems to be giving the threads lines it already read? I was under the impression readline() was sequential and all I needed to do was make sure the calls to readline() were synchronized.
I'll show some of the Main class code below. The runnable is essentially a while loop that calls readNextLine() and processes each line until there are no more lines left.
Main class:
//showing reader and thread creation
inputStream = sftp.get(path to file);
reader = new BufferedReader(new InputStreamReader(inputStream));
ExecutorService executor = Executors.newFixedThreadPool(threads);
Collection<Future> futures = new ArrayList<Future>();
for(int i=0;i<threads;i++){
MyRunnable runnable = new MyRunnable(i);
futures.add(executor.submit(runnable));
}
LOGGER.debug("futures.get()");
for(Future f:futures){
f.get(); //use to wait until all threads are done
}
public synchronized static String readNextLine(){
String results = null;
try{
if(reader!=null){
results = reader.readLine();
}
}catch(Exception e){
LOGGER.error("Error reading from file");
}
return results;
}
I'm testing what you said, but I found you get an error logic in your readNextLine() method, how can reader.readLine() be invoked as the results is null and the if condition is it is not null?
Now I finished my demo, and it seems it works well, the following is the demo, no re-read line happened:
static BufferedReader reader;
public static void main(String[] args) throws FileNotFoundException, ExecutionException, InterruptedException {
reader = new BufferedReader(new FileReader("test.txt"));
ExecutorService service = Executors.newFixedThreadPool(3);
List<Future> results = new ArrayList<Future>();
for (int i = 0; i < 3; i++) {
results.add(service.submit(new Runnable() {
#Override
public void run() {
try {
String line = null;
while ((line = readNextLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
}));
}
}
public synchronized static String readNextLine() throws IOException {
return reader.readLine();
}

Issue with wait - notify implementation

I am working on Java multithreading , where I am starting 4 threads after assigning 4 different files to them , to be uploaded to the server.
My objective is , when one thread completes file upload , I need to start another thread assigning a new file to it.
After each file upload , I receive a notification from the server.
// The code for adding the first set of files
for (int count = 0; count < 4; count++) {
if (it.hasNext()) {
File current = new File((String) it.next());
try {
Thread t = new Thread(this, current );
t.start();
t.sleep(100);
} catch (Exception e) {
}
}
}
Now , I am assigning another thread with a file & keeping the thread in a wait state .
When a previous thread notifies , the current thread should start upload.
if (tempThreadCounter == 4 ) {
if (it.hasNext()) {
File current = new File((String) it.next());
try {
Thread t = new Thread(this, current);
t.start();
synchronized (this) {
t.wait();
}
tempThreadCounter++;
} catch (Exception e) {
}
}
}
On the final statement on the run method , I am adding the following statement.
public void run (){
// Performing different operations
//Final statement of the run method below
synchronized (this) {
this.notifyAll();
}
}
Currently , all the 5 threads are starting uploading at the same time.
It should be that the first 4 threads should start uploading & the fifth thread should start only when it it notified by any thread that it had completed its operation.
Any suggestions on the incorrect Thread implementation.
You can use ExecutorService with newFixedThreadPool and specify a concurrency of 1. But really, then why do you need multiple threads? One thread doing all the uploads so the user interface remains responsive should be enough.
ExecutorService exec = Executors.newFixedThreadPool(1); //1 thread at a time
for (int count = 0; count < 4; count++) {
if (it.hasNext()) {
File current = new File((String) it.next());
exec.execute(new Runnable() {
#Override
public void run() {
upload(current);
}
});
}
}
exec.shutdown();
exec.awaitTermination(900, TimeUnit.SECONDS);
Throw it all away and use java.util.concurrent.Executor.
you can join on the thread instead of waiting on it
try {
t.join();
}catch(InterruptedException e){
throw new RuntimeException(e);
}
tempThreadCounter++;

Thread stopping in synchronized block

I am facing a problem with a stopping thread which is in a synchronized block. I am using TCP socket. The problem is that I am waiting on a DataInputStream object and I want to create a new socket again but it doesn't allow me to do because of the synchronized block.
I have tried with Thread.interrupted(). I cannot avoid the synchronized block. Is there any other way to do the same?
dis = new DataInputStream(ReadWriteData.kkSocket.getInputStream());
int i = -1;
String aval = ""; //new String();
char c = (char)dis.read();
It is getting blocked on dis.read().
What I should do for escaping the dis.read when I want to create a new socket?
Thanks in advance.
You could always check if there is data available to read, by calling dis.available() to determine the number of bytes that can be read without blocking.
Using some additional logic could then allow for the creation of the new socket.
You could close the stream and catch it that way, but that may not always be the best option.
If you know how many streams you want to have you can do something like:
private static final int N_STREAMS = 3;
...
InputStream[] streams = new InputStream[N_STREAMS];
List<StringBuilder> outputBuilders = new ArrayList<StringBuilder>();
for (int i=0; i < N_STREAMS; i++) {
// A StringBuilder for every stream
outputBuilder.append(new StringBuilder());
try {
streams[i] = new DataInputStream(ReadWriteData.kkSocket.getInputStream());
} catch (IOException e) {
// Propagate error or ignore?
}
}
// Read data
for (int i=0; i < N_STREAMS; i++) {
InputStream currentStream = streams[i];
StringBuilder currentBuilder = outputBuilders.get(i);
if (currentStream != null && currentStream.available() > 0) {
try {
currentBuilder.append(stream.read());
} catch (IOException e) {
// Do something
}
}
if (currentStream != null && currentStream??? == EOF) {
// I don't know how to detect EOF on this stream...
try {
streams[i] = null; // Mark it as closed
currentStream.close();
} catch (...) {
// Do something
}
}
}
// You know have a StringBuilder per socket; do with it what you want
As you might have noticed I don't know anything about Android. But this seems generic enough to fit your usage, or at least provide a hint.

Categories