just want to make clear an understanding on using for loops inside a SwingWorker doInbackground method.
For example, I have a list of files stored in Files ( File[] Files = ... ).
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
for( File f : Files ){
// process file f
}
}
}
....
scanFiles.execute();
In the above, is it alright to use a for loop inside the doInBackGround() method to go through a list of files , or is it better to bring the for loop outside the doInBackground() method, as in something like this:
for ( File f: Files ){
processFile(f);
}
private void processFile(File f){
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
// do something with f
}
}
}
The above are skeleton code and not actual working code. Just for illustration of what I want to do only. That is, I don't want my program to scan files one by one. I want to do something like parallel processing of files...
thanks
As mentioned in some of the comments: The appropriate solution heavily depends on how many files you want to process, and what processFile actually does.
The main difference between your approaches is (as MadProgrammer already said)
The first one creates one background thread that processes all the files
The second one creates many background threads, each processing one file
The border cases where either of the approaches is not appropriate are analogously:
The first one may be better when there many files, and processFile is a simple operation
The second one may be better when there are few files and processFile is a complex operation
But this is only a rough classification, and which one is the "best" approach still depends on other factors.
However, I'd like to propose another solution, that allows you to rather flexibly "shift" between the two extremes: You could create a List containing the File objects, and split this list into a specified number of "chunks" to let them be processed by the SwingWorker.
Sketched here, to show the basic idea: You create a method that processes a list of files with a SwingWorker:
private void processFiles(final List<File> files) {
SwingWorker<Object, Object> scanFiles = new SwingWorker<Object, Object>(){
#Override
public Object doInBackground(){
// do something with files
}
}
}
Then, at the call site, you can do the following:
// Obtain the list of files to process
File files[] = ...
List<File> fileList = Arrays.asList(files);
// Define the number of workers that should be used
int numWorkers = 10;
// Compute how many files each worker will process
int chunkSize = (int)Math.ceil((double)fileList.size() / numWorkers);
for (int i=0; i<numWorkers; i++) {
// Compute the part of the "fileList" that the worker will process
int minIndex = i * chunkSize;
int maxIndex = i * chunkSize + chunkSize;
maxIndex = Math.min(maxIndex, fileList.size());
List<File> chunk = fileList.sublist(minIndex, maxIndex);
// Start the worker
processFiles(chunk);
}
(This is only a sketch. There may be some index-hassle involved. If desired, I can post a more elaborate version of this. Until now, it only shows the basic idea)
Then, you can define how many worker threads you would like to use (maybe even depending on the number of Runtime.getRuntime().availableProcessors()).
If you want to process files parallely you must spawn some thread workers so the second sample should be your choice. You can inform the UI, or other components of your program, about the progress of processing files using following methods : protected void process(List<V> chunks), protected final void publish(V... chunks)
private void processFile(File f){
scanFiles = new SwingWorker<Object, Object>(){
public Object doInBackground(){
publish(V... chunks)
}
}
}
protected void process(List<V> chunks) {
//do something with intermediate data, for example show progress in the ui
}
Related
I'm making a program that gets live price information from an API. I then want to display the price information on a JavaFX chart that live updates. When I try to pass the information to the JavaFX Thread it doesn't always pass over correctly and the JavaFX thread doesn't get the price information.
The API call is done on a separate thread and when it has the information it then calls the updateScene method. This is where I get an issue, the API Thread has all the information I try and set the variable for the JavaFX thread to use and it has none of the information.
private CandleStickFeed candleStickFeed;
public void updateScene(CandleStickFeed candleStickFeed){
this.candleStickFeed = candleStickFeed;
System.out.println("Feed Size This Thread = " + candleStickFeed.getSize());
Platform.runLater(new Runnable(){
#Override
public void run(){
System.out.println("Feed Size JavaFX Thread = " + candleStickFeed.getSize());
updateChart();
}
});
}
The program will sometimes output
Feed Size This Thread = 5
Feed Size JavaFX Thread = 5
Which is what I would expected. But it also sometimes outputs
Feed Size This Thread = 5
Feed Size JavaFX Thread = 0
Any help would be greatly appreciated. I new to using multiple threads so not sure what I'm doing really. I have looked for different answers but couldn't find any. Thank you
Try to extract the relevant information from the candleStickFeed, and pass that structure into a Runnable subclass.
public void updateScene(CandleStickFeed candleStickFeed) {
CandleStickData data = new CandleStickData(candleStickFeed);
Platform.runLater(new ChartUpdateRunnable (data));
}
private class CandleStickData {
private double[] numbers; // or whatever you need
CandleStickData (CandleStickFeed candleStickFeed) {
this.numbers = new double[candleStickFeed.getSize()];
// TODO: populate the data structure
}
}
private ChartUpdateRunnable implements Runnable {
private CandleStickData data;
ChartUpdateRunnable(CandleStickData data) {
this.data = data;
}
#Override
public void run(){
System.out.println("Feed Size JavaFX Thread = " + data);
updateChart();
}
}
The principle is to not pass around a feed class which might change state often, but extract an immutable state object, and pass that to a runnable class instance for update.
This is a supplement to #Simon's answer.
Your problem is not about "passing a variable." Your runLater(...) task apparently is getting the value of the variable. The problem is that the value is a reference to a mutable object, and that object sometimes is modified in between the time when the task is created and the time when the task is executed.
Simon's suggestion boils down to this: Give the new task its own private copy of whatever information it will need from the CandleStickFeed instance, and let it work exclusively from that private copy.
Here is a brief of what i want to do , I have a scenario where
number of text files are generated dynamically on daily basis. 0
to 8 per day. size of each file can be small to big. depending on
day's data.
Need to run some checks (business checks) on them.
I plan to complete the task in minimum time, hence trying to write a parallel executor for performing checks on these files.
My idea is
Store n files in a concurrent collection (ConcurrentLinkedQueue)
remove a file, spawn a thread , that runs all checks on the file
since 1 file has no relation to another i want to be able to process multiple files
Store results in another concurrent collection ( ConcurrentLinkedQueue ... which is converted to different html pdf reports)
NOTE : number of threads can be different from number of files (I want to number of threads configurable , its not the case where number of file = number of threads )
My understanding is This way i should be able to complete the DAILY checks in minimum time.
I have my code like below , what confuses me "how to store all thread's results in single collection after each thread's completion" , my gut feeling is i am doing something funny (incorrect) the way i am storing results.
Second ques wanted to check if anyone forsees any other issues in code snippet below
Third ques this seems like a common use case ( to me ) any pointers to design pattern code snippets solving this
Note : i am using JDK 6.
public class CheckExecutor {
// to store all results of all threads here , then this will be converted to html/pdf files
static ConcurrentLinkedQueue<Result> fileWiseResult = new ConcurrentLinkedQueue<Result>();
public static void main(String[] args) {
int numberOfThreads=n; // need keep it configurable
Collection<ABCCheck> checksToExecute // will populate from business logic , ABCCheck is interface , has a method check() , there are different implementations
ConcurrentLinkedQueue<File> fileQueue = new ConcurrentLinkedQueue<File>(); // list of files for 1 day , may vary from 0 to 8
int maxNumOfFiles = fileQueue.size();
ThreadGroup tg = new ThreadGroup ("Group");
// If more number of threads than files (rare , can be considered corener case)
if (maxNumOfFiles < numberOfThreads) numberOfThreads=maxNumOfFiles;
// loop and start number of threads
for(int var=0;var<numberOfThreads;var++)
{
File currentFile = fileQueue.remove();
// execute all checks on 1 file using checksToExecute
ExecuteAllChecks checksToRun = new ExecuteAllChecks(); // business logic to populate checks
checksToRun.setchecksToExecute(checksToExecute);
checksToRun.setcheckResult(fileWiseResult); // when each check finishes want to store result here
new Thread (tg , checksToRun , "Threads for "+currentFile.getName()).start();
}
// To complete the tasak ... asap ... want to start a new thread as soon as any of current thread ends (diff files diff sizes)
while(!fileQueue.isEmpty()) {
try {
Thread.sleep(10000); // Not sure If this will cause main thread to sleep (i think it will pause current thread ) i want to pause main thread
} catch (InterruptedException e) {
e.printStackTrace();
}
// check processing of how many files completed
if( (tg.activeCount()<numberOfThreads) && (fileQueue.size()>0) ) {
int numOfThreadsToStart = numberOfThreads - tg.activeCount();
for(int var1=0;var1<numOfThreadsToStart;var1++) {
File currentFile = fileQueue.remove();
ExecuteAllchecks checksToRun = new ExecuteAllchecks();
checksToRun.setchecksToExecute(checksToExecute);
checksToRun.setcheckResult(fileWiseResult); // when each check finishes want to store result here
new Thread (tg , checksToRun , "Threads for "+currentFile.getName()).start();
}
}
}
}
}
class ExecuteAllchecks implements Runnable {
private Collection<ABCCheck> checksToExecute;
private ConcurrentLinkedQueue<Result> checkResult; // not sure if its correct , i want to store result off all threads here
public ConcurrentLinkedQueue<Result> getcheckResult() {
return checkResult;
}
// plan to instantiate the result collection globally and store result here
public void setcheckResult(ConcurrentLinkedQueue<Result> checkResult) {
this.checkResult = checkResult;
}
public Collection<ABCCheck> getchecksToExecute() {
return checksToExecute;
}
public void setchecksToExecute(Collection<ABCCheck> checksToExecute) {
this.checksToExecute = checksToExecute;
}
#Override
public void run() {
Result currentFileResult = new Result();
// TODO Auto-generated method stub
System.out.println("Execute All checks for 1 file");
// each check runs and calls setters on currentFileResult
checkResult.add(currentFileResult);
}
}
The actual implementation is very influenced by the nature of the computations itself, but somewhat general approach could be:
private final ExecutorService executor = Executors.newCachedThreadPool();
private final int taskCount = ...;
private void process() {
Collection< Callable< Result > > tasks = new ArrayList<>( taskCount );
for( int i = 0; i < taskCount; i++ ) {
tasks.add( new Callable< Result >() {
#Override
public Result call() throws Exception {
// TODO implement your logic and return result
...
return result;
}
} );
}
List< Future< Result > > futures = executor.invokeAll( tasks );
List< Result > results = new ArrayList<>( taskCount );
for( Future< Result > future : futures ) {
results.add( future.get() );
}
}
I would also recommend using sensible timeouts on future.get() invocations in order to executing thread not to stuck.
Still, I would't also recommend using cached thread pool in production as this pool is increasing whenever current pool doesn't have enough capacity for all tasks, but rather using something like Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors() )
I you actual task could be splitter into several small ones and the later be joined consider checking how that could be efficiently be done using ForkJoin framework
I want to do a task that I've already completed except this time using multithreading. I have to read a lot of data from a file (line by line), grab some information from each line, and then add it to a Map. The file is over a million lines long so I thought it may benefit from multithreading.
I'm not sure about my approach here since I have never used multithreading in Java before.
I want to have the main method do the reading, and then giving the line that has been read to another thread which will format a String, and then give it to another thread to put into a map.
public static void main(String[] args)
{
//Some information read from file
BufferedReader br = null;
String line = '';
try {
br = new BufferedReader(new FileReader("somefile.txt"));
while((line = br.readLine()) != null) {
// Pass line to another task
}
// Here I want to get a total from B, but I'm not sure how to go about doing that
}
public class Parser extends Thread
{
private Mapper m1;
// Some reference to B
public Parse (Mapper m) {
m1 = m;
}
public parse (String s, int i) {
// Do some work on S
key = DoSomethingWithString(s);
m1.add(key, i);
}
}
public class Mapper extends Thread
{
private SortedMap<String, Integer> sm;
private String key;
private int value;
boolean hasNewItem;
public Mapper() {
sm = new TreeMap<String, Integer>;
hasNewItem = false;
}
public void add(String s, int i) {
hasNewItem = true;
key = s;
value = i;
}
public void run() {
while (!Thread.currentThread().isInterrupted()) {
try {
if (hasNewItem) {
// Find if street name exists in map
sm.put(key, value);
newEntry = false;
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
// I'm not sure how to give the Map back to main.
}
}
I'm not sure if I am taking the right approach. I also do not know how to terminate the Mapper thread and retrieve the map in the main. I will have multiple Mapper threads but I have only instantiated one in the code above.
I also just realized that my Parse class is not a thread, but only another class if it does not override the run() method so I am thinking that the Parse class should be some sort of queue.
And ideas? Thanks.
EDIT:
Thanks for all of the replies. It seems that since I/O will be the major bottleneck there would be little efficiency benefit from parallelizing this. However, for demonstration purpose, am I going on the right track? I'm still a bit bothered by not knowing how to use multithreading.
Why do you need multiple threads? You only have one disk and it can only go so fast. Multithreading it won't help in this case, almost certainly. And if it does, it will be very minimal from a user's perspective. Multithreading isn't your problem. Reading from a huge file is your bottle neck.
Frequently I/O will take much longer than the in-memory tasks. We refer to such work as I/O-bound. Parallelism may have a marginal improvement at best, and can actually make things worse.
You certainly don't need a different thread to put something into a map. Unless your parsing is unusually expensive, you don't need a different thread for it either.
If you had other threads for these tasks, they might spend most of their time sitting around waiting for the next line to be read.
Even parallelizing the I/O won't necessarily help, and may hurt. Even if your CPUs support parallel threads, your hard drive might not support parallel reads.
EDIT:
All of us who commented on this assumed the task was probably I/O-bound -- because that's frequently true. However, from the comments below, this case turned out to be an exception. A better answer would have included the fourth comment below:
Measure the time it takes to read all the lines in the file without processing them. Compare to the time it takes to both read and process them. That will give you a loose upper bound on how much time you could save. This may be decreased by a new cost for thread synchronization.
You may wish to read Amdahl's Law. Since the majority of your work is strictly serial (the IO) you will get negligible improvements by multi-threading the remainder. Certainly not worth the cost of creating watertight multi-threaded code.
Perhaps you should look for a new toy-example to parallelise.
I need to process a large file (with columns and same format lines). Since I need to consider the cases that the program crashes during the processing, I need this processing program to be retryable, which means after it crashes and I start the program again, it can continue to process the file starting with the line it failed.
Is there any pattern I can follow or library I can use? Thank you!
Update:
About the crashing cases, it is not just about OOM or some internal issues. It also could be caused by the timeout with other parts or machine crashing. So try/catch can't handle this.
Another update:
About the chunking the file, it is feasible in my case but not that as simple as it sounds. As I said, the file is formatted with several columns and I can split it up into hundreds of files based on one of the column and then process the files one by one. But instead of doing this, I would like to learn more about the common solution about processing big file/data supporting retrying.
How I would do it (though am not a pro)
Create a LineProcessor called on every line in file
class Processor implements LineProcessor> {
private List<String> lines = Lists.newLinkedList();
private int startFrom = 0;
private int lineNumber = 0;
public Processor(int startFrom) {
this.startFrom = startFrom;
}
#Override
public List<String> getResult() {
return lines;
}
#Override
public boolean processLine(String arg0) throws IOException {
lineNumber++;
if (lineNumber < startFrom) {
// do nothing
} else {
if (new Random().nextInt() % 50000 == 0) {
throw new IOException("Randomly thrown Exception " + lineNumber);
}
//Do the hardwork here
lines.add(arg0);
startFrom++;
}
return true;
}
}
Create a Callable for Reading Files that makes use of my LineProcessor
class Reader implements Callable<List<String>> {
private int startFrom;
public Reader(int startFrom) {
this.startFrom = startFrom;
}
#Override
public List<String> call() throws Exception {
return Files.readLines(new File("/etc/dictionaries-common/words"),
Charsets.UTF_8, new Processor(startFrom));
}
}
Wrap the Callable in a Retryer and call it using an Executor
public static void main(String[] args) throws InterruptedException, ExecutionException {
BasicConfigurator.configure();
ExecutorService executor = Executors.newSingleThreadExecutor();
Future<List<String>> lines = executor.submit(RetryerBuilder
.<List<String>> newBuilder()
.retryIfExceptionOfType(IOException.class)
.withStopStrategy(StopStrategies.stopAfterAttempt(100)).build()
.wrap(new Reader(100)));
logger.debug(lines.get().size());
executor.shutdown();
logger.debug("Happily Ever After");
}
You could maintain a checkpoint/commit style logic in your code. So when the program runs again it starts from the same checkpoint.
You can use RandomAccessFile to read the file and use the getFilePointer() as your checkpoint which you preserver. When you execute the program again you start with this checkpoint by calling seek(offset).
Try catch won's save from OOM error. You should process the file in chunks and store the location after every successfull chunck into filesystem/database/what ever place where it remains persistent even if your program crashes. Then you can read the previous point from the place you stored it when you restart your software. You must also cleanup this information when the whole file is processed.
i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this.
EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation
I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it.
So here is shortly pseudo code of my algorithm
LIST termList (there is method that creates this list from lucene index)
FOLDER topFolder
INPUT topFolder
IF it is folder and not empty
list files (there are 30 sub folders inside)
FOR EACH sub folder
GET file "CheckedFile.txt"
analyze(CheckedFile)
ENDFOR
END IF
Method ANALYZE(CheckedFile)
read CheckedFile
WHILE CheckedFile has next line
GET line
FOR(loops through termList)
GET third word from line
IF third word = term from list
append whole line to string buffer
ENDIF
ENDFOR
END WHILE
OUTPUT string buffer to file
Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs???
I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK!
Please please help.
You can just use an empty reducer, and partition your job to run a single mapper per file. Each mapper will create its own output file in your output folder.
Map Reduce is easily implemented using some nice Java 6 concurrency features, especially Future, Callable and ExecutorService.
I created a Callable that will analyse a file in the way you specified
public class FileAnalyser implements Callable<String> {
private Scanner scanner;
private List<String> termList;
public FileAnalyser(String filename, List<String> termList) throws FileNotFoundException {
this.termList = termList;
scanner = new Scanner(new File(filename));
}
#Override
public String call() throws Exception {
StringBuilder buffer = new StringBuilder();
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
String[] tokens = line.split(" ");
if ((tokens.length >= 3) && (inTermList(tokens[2])))
buffer.append(line);
}
return buffer.toString();
}
private boolean inTermList(String term) {
return termList.contains(term);
}
}
We need to create a new callable for each file found and submit this to the executor service. The result of the submission is a Future which we can use later to obtain the result of the file parse.
public class Analayser {
private static final int THREAD_COUNT = 10;
public static void main(String[] args) {
//All callables will be submitted to this executor service
//Play around with THREAD_COUNT for optimum performance
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
//Store all futures in this list so we can refer to them easily
List<Future<String>> futureList = new ArrayList<Future<String>>();
//Some random term list, I don't know what you're using.
List<String> termList = new ArrayList<String>();
termList.add("terma");
termList.add("termb");
//For each file you find, create a new FileAnalyser callable and submit
//this to the executor service. Add the future to the list
//so we can check back on the result later
for each filename in all files {
try {
Callable<String> worker = new FileAnalyser(filename, termList);
Future<String> future = executor.submit(worker);
futureList.add(future);
}
catch (FileNotFoundException fnfe) {
//If the file doesn't exist at this point we can probably ignore,
//but I'll leave that for you to decide.
System.err.println("Unable to create future for " + filename);
fnfe.printStackTrace(System.err);
}
}
//You may want to wait at this point, until all threads have finished
//You could maybe loop through each future until allDone() holds true
//for each of them.
//Loop over all finished futures and do something with the result
//from each
for (Future<String> current : futureList) {
String result = current.get();
//Do something with the result from this future
}
}
}
My example here is far from complete, and far from efficient. I haven't considered the sample size, if it's really huge you could keep looping over the futureList, removing elements that have finished, something similar to:
while (futureList.size() > 0) {
for (Future<String> current : futureList) {
if (current.isDone()) {
String result = current.get();
//Do something with result
futureList.remove(current);
break; //We have modified the list during iteration, best break out of for-loop
}
}
}
Alternatively you could implement a producer-consumer type setup where the producer submits callables to the executor service and produces a future and the consumer takes the result of the future and discards then future.
This would maybe require the produce and consumer be threads themselves, and a synchronized list for adding/removing futures.
Any questions please ask.