Thread Pool problems - Collecting files - java

I need to serialize into an ArrayList all absolute files paths from a location. I want do that with a FixedThreadPool from ExecutorService.
Example- location: c:/folder1; folder1 have more folders inside, all with files. I want every time I find a folder search their files to add to ArrayList.
public class FilePoolThreads extends Thread {
File fich;
private ArrayList al1;
public FilePoolThreads(File fi, ArrayList<String> al) {
this.fich = fi;
this.al1 = al;
}
public void run() {
FileColector fc = new FileColector();
File[] listaFicheiros = fich.listFiles();
for (int i = 0; i < listaFicheiros.length; i++) {
if (listaFicheiros[i].isFile()) {
al1.add(listaFicheiros[i].getAbsolutePath());
}
}
}
}
The class which I begin the collection of files:
public class FileColector {
private ArrayList<String> list1 = new ArrayList<>();
public static ArrayList<String> search(File fich,ArrayList<String> list1) {
int n1 = 1;
ExecutorService executor = Executors.newFixedThreadPool(n1);
do {
// FilePoolThreads[] threads=new FilePoolThreads[10];
FilePoolThreads mt = new FilePoolThreads(fich, list1);
executor.execute(mt);
} while (fich.isDirectory());
executor.shutdown();
return list1;
}
My code is not working well, I think I have some fails of logic, I need someone help me fix it and how can I return the ArrayList? I have to use the getInputStream before and then the getOutputStream?

Since this is apparently an academic exercise, I'll give an overview of how I would approach this problem given your requirement that you use an executor thread pool.
First, you need to analyze the problem and break it into repeatable units of work that can be done independently of each other. In this case, the basic unit of work is processing a single filesystem directory. Each time you process a directory, you will:
Examine each directory entry.
If the directory entry is a regular file, add it to your list.
If the directory entry is a sub-directory, submit it to be processed.
Next, you need to create an implementation of Runnable to encapsulate the processing of this basic unit of work. Each instance of the class that you create will need at least the following information:
The File representing the directory it is to process.
A list, shared between all workers, to add files to (and, as others have pointed out, ArrayList is not a suitable data structure for this).
A reference to the executor service, for submitting tasks for the sub-directories.
Finally, you would need to create a worker for the top-level directory to process; submit it to the executor service; and then wait until all workers have finished processing. This last part might be the trickiest - you might need to keep a running count, using an AtomicInteger that you pass to each worker, to keep track of how many workers are currently processing.

Don't extend Thread to pass your task to an exectuor. Implement Runnable instead!
Or, implement Callable which can return a result when it's finished executing.
Then you can pass your tasks to ExecutorService.submit() and get back a Future to get() the result of each task's computation when it is done.
Note that you will probably want to recursively visit sub-directories, so that you need to find both files and directories before adding the files to your output and creating new tasks for the directories.

There's no need for threading here, and you have several errors related to trying to use threads. My advice is to forget threading and just solve your real problem, which can be done very simply with something like commons-io FileUtils:
Iterator<File> files = FileUtils.iterateFiles(directoryToScan, FileFileFilter.FILE, TrueFileFilter.INSTANCE);
List<String> paths = new ArrayList<String>();
for (File file : files) {
paths.add(file.getAbsolutePath);
}
That's all.

Related

How manually read data from Flink's checkpoint file and keep in Java memory

We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance
The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.
You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.
I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.

How to make org.apache.commons.io.monitor behave multi-threaded?

In my Java micro-service, I am overriding onFileCreate() function. This method exist in inbuilt library = org.apache.commons.io.monitor, class = FileAlterationListenerAdaptor, method = void onFileCreate(final File file) .
I noticed that even if there are multiple files created, there is only a single thread which is listening to file creations. That means it processes files one by one (synchronous) , instead of multiple at the same time. How can I achieve multi-threading behavior here?
I don't know if it is relevant but I noticed that some of the methods defined in this inbuilt library are 'synchronized'. I am talking about class=FileAlterationMonitor, methods= setThreadFactory(), start(), stop(). Is that the reason? If yes, do I need to override all these 3 methods, or some of them?
enter image description here
setThreadFactory will not help you, it is just an alternative way to create that single thread which monitors file system.
What you need to do is
Create thread pool which will do work for new files. This way you can control how many new files you process in parallel (trust me, you do not want unlimited concurrency)
Your FileAlterationListenerAdaptor.onFileCreate should not process file by itself. Instead, it should submit task to thread pool
Roughly, code should be something like that
int numberOfThreads = ...;
ExecutorService pool = java.util.concurrent.Executors.newFixedThreadPool(numberOfThreads);
FileAlterationListenerAdaptor adaptor = new FileAlterationListenerAdaptor() {
public void onFileCreate(final File file) {
pool.submit(new Runnable() {
// here you do file processing
doSomethingWithFile(file);
});
}
}
....
FileAlterationObserver observer = new FileAlterationObserver(directory);
observer.addListener(adaptor);
...
FileAlterationMonitor monitor = new FileAlterationMonitor(interval);
monitor.addObserver(observer);
monitor.start();

Encapsulating a multi-threaded operation in Java

I have a situation where I have a large number of classes that need to do file (read only) access. This is part of a web app running on top of OSGI, so there will be a lot of concurrent needs to access.
So I'm building an OSGI service to access the file system for all the other pieces that will need it and provide a centralized access as this also simplifies configuration of file locations, etc.
It occurs to me that a multi-threaded approach makes the most sense along with a thread pool.
So the question is this:
If I do this and I have a service with an interface like:
FileService.getFileAsClass(class);
and the method getFileAsClass(class) looks kinda like this: (this is a sketch it may not be perfect java code)
public < T> T getFileAsClass(Class< T> clazz) {
Future<InputStream> classFuture = threadpool.submit(new Callable< InputStream>() {
/* initialization block */
{
//any setup from configs.
}
/* implement Callable */
public InputStream call() {
InputStream stream = //new inputstream from file location;
boolean giveUp = false;
while(null == stream && !giveUp) {
//Code that tries to read in the file 4
// times with a Thread.sleep() then gives up
// this is here t make sure we aren't busy updating file.
}
return stream;
}
});
//once we have the file, convert it and return it.
return InputStreamToClassConverter< T>.convert(classFuture.get());
}
Will that correctly wait until the relevant operation is done to call InputStreamtoClassConverter.convert?
This is my first time writing multithreaded java code so I'm not sure what I can expect for some of the behavior. I don't care about order of which threads complete, only that the file handling is handled async and once that file pull is done, then and only then is the Converter used.

How to wait and notify between separate objects in Java?

General purpose of program
To read in a bash-pattern and specified location from command line, and find all files matching that pattern in the location but I have to make the program multi-threaded.
General structure of the program
Driver/Main Class which parses arguments and initiates other classes.
ProcessDirectories Class which adds all directory addresses found from the specified root directory to a string array for processing later
DirectoryData Class which holds the addresses found in the above class
ProcessMatches Class which examines each directory found, and adds any files inside that match the pattern to a string array for printing results later
Main/Driver once again takes over and prints the results :)
The Problem
I need to be processing matches even whilst the ProcessDirectories class is still working (for efficiency so I don't unnecessarily wait for the list to populate before doing work). To do this I try to: a) make ProcessMatches threads wait() if DirectoryData is empty b) make ProcessDirectories notifyAll() if added a new entry.
The Question :)
Every tutorial I look at is focused on the producer and consumer being in the same object, or dealing with just one data structure. How can I do this when I am using more than one data structure and more than one class for producing and consuming?
How about something like:
class Driver(String args)
{
ProcessDirectories pd = ...
BlockingQueue<DirectoryData> dirQueue = new LinkedBlockingQueue<DirectoryData>();
new Thread(new Runnable(){public void run(){pd.addDirs(dirQueue);}}).start();
ProcessMatches pm = ...
BlockingQueue<File> fileQueue = new LinkedBlockingQueue<File>();
new Thread(new Runnable()
{
public void run()
{
for (DirectoryData dir = dirQueue.take(); dir != DIR_POISON; dir = dirQueue.take())
{
for (File file : dir.getFiles())
{
if (pm.matches(data))
fileQueue.add(file)
}
}
fileQueue.add(FILE_POISON);
}
}).start();
for (File file = fileQueue.take(); file != FILE_POISON; file = fileQueue.take())
{
output(file);
}
}
This is just a rough idea of course. ProcessDirectories.addDirs() would just add DirectoryData objects to the queue. In production you'd want to name the threads. Perhaps use an executor to provide manage threads. Perhaps use some other mechanism to indicate end of processing than a poison message. Also, you might want to reduce the limit on the queue size.
Have one data structure that's associated with the data the two threads communicate with each other. This can be a queue that has "get data from queue, waiting if empty" and "put data on queue, waiting if full" functions. Those functions should internally call notify and wait on the queue itself and they should be synchronized to that queue.

Locking file across services

What is the best way to share a file between two "writer" services in the same application?
Edit:
Sorry I should have given more details I guess.
I have a Service that saves entries into a buffer. When the buffer gets full it writes all the entries to the file (and so on). Another Service running will come at some point and read the file (essentially copy/compress it) and then empty it.
Here is a general idea of what you can do:
public class FileManager
{
private final FileWriter writer = new FileWriter("SomeFile.txt");
private final object sync = new object();
public void writeBuffer(string buffer)
{
synchronized(sync)
{
writer.write(buffer.getBytes());
}
}
public void copyAndCompress()
{
synchronized(sync)
{
// copy and/or compress
}
}
}
You will have to do some extra work to get it all to work safe, but this is just a basic example to give you an idea of how it looks.
A common method for locking is to create a second file in the same location as the main file. The second file may contain locking data or be blank. The benefit to having locking data (such as a process ID) is that you can easily detect a stale lockfile, which is an inevitability you must plan for. Although PID might not be the best locking data in your case.
example:
Service1:
creates myfile.lock
creates/opens myfile
Service2:
Notices that myfile.lock is present and pauses/blocks/waits
When myfile.lock goes away, it creates it and then opens myfile.
It would also be advantageous for you to double-check that the file contains your locking information (identification specific to your service) right after creating it - just in case two or more services are waiting and create a lock at the exact same time. The last one succeeds and so all other services should notice that their locking data is no longer in the file. Also - pause a few milliseconds before checking its contents.

Categories