A limited thread & creating more on demand - java

I have a program where the user fetches some data from the stored files. That can be 1, 2, 3, 10, 50, 100, 1000 etc. files. I want that files to be fetched via separated thread. But when the thread reaches e.g. 50th file then a new thread is created. And if it's not enough then 3rd, 4th, 5th, etc. threads are created until all the data from the files would be retrieved.
So, the question is: how to make a limit of 50 files to check for 1 thread and how to create new ones for the next files processing? Maybe some ExecutorService, synchronizer, or something which I have some difficulties to understand yet...

Related

How to create a condition for wait and service block?

I'm learning about anylogic simulation.
My simulation is a process assemble. There is many workstation, and each one them do something activities, so, it's diferents times.
Modeling Process
I'm trying to write a condition for this process, some like rotine below...
For exemple:
if service block number 2 is occupated, all the process before must to wait until it finished.
And service block 3, 4, 5... 10, 11, 12... same thing.
How I should to do?

window trigger doesn't return the most updated result

I'm trying to measure the latency of a Flink application which have a window operation as shown below:
SingleOutputStreamOperator<String> branch = stream
.getSideOutput(outputTag2)
.keyBy(MetricObject::getRootAssetId)
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(15)))
.aggregate(new CountDistinctAggregate(), new CountDistinctProcess())
.name("windowed-count-distinct")
.uid("windowed-count-distinct")
.map((value)->String.valueOf(value.getTimestamp().toEpochMilli()))
.name("send-timestamp");
I'm considering event time and to extract timestamp I use this watermark strategy:
.<SingleRecord>forBoundedOutOfOrderness(Duration.ofSeconds(15))
.withTimestampAssigner((event, timestamp) -> event.getTimestamp().toEpochMilli()))
The aggregation function saves a particular object as an accumulator containing also the timestamp extracted; these timestamps are written in a kafka topic. The problem is that the timestamps returned are these:
1639651859988
1639651890163
1639651904900
1639651919728
1639651919728
1639651949973
1639651965085
1639651979870
the timestamps returned aren't equally spaced as I was expecting, and the fourth and fifth are equal but they have been returned spaced by 15 seconds and this isn't possible because the input of the application records are continuously generated every second (10 per second). In other tests I obtained also worse situations like this:
1639651979870
1639651992771
1639651992771
1639651992771
1639651992771
1639652189791
1639652205001
1639652219876
the curious fact is that, when I use a simple tumbling window without trigger:
.window(TumblingEventTimeWindows.of(Time.seconds(15)))
the timestamps returned are equally spaced as expected:
1639652429766
1639652444930
1639652459900
1639652474609
1639652489746
1639652504862
1639652519734
1639652534847
I really don't understand which is the problem, it seems like the accumulators in the aggregation function doesn't upgrade properly.
I think you could check the input data to Flink streaming to verify whether the result is what you expect.
For the first streaming, the aggregate operation runs 4 times(every 15 sec) over the data set of 60 seconds' window. Not sure your logic in the aggregation. E.g. assuming we have a window of 3 sec, and the trigger is every 1 sec. The operator is to get the max element in the window. Also assuming the input is generated every 1 sec. If the input is 1, 3, 2, ..., then we will see output like 1, 3, 3... from Flink as the first window has [1, 3, 2] in its pane, and for each trigger to get max element, the result will be 1, 3, 3.
For the second streaming job, each window has one trigger time, so e.g. using the input from above as an example, if the window is for 1 sec, we will get 1, 3, 2, ...

Process a text file line by line using parallelism but preserving order

I need to process the content of a plain text file line by line.
Since processing every single line requires some time-consuming processing (access to external resources), I'd like to execute it concurrently.
I could easily do that with a ThreadPoolExecutor but the problem is that I need to write the output maintaining the input order (even if I know that this would be non-optimal from a CPU usage standpoint).
Another constraint is that the input file could be huge, so keeping it all in memory in some sort of structure, is not an option.
Any idea?
You could use the typical Producer Consumer pattern.
1) A thread reading the input file and creating a block of work. This block can have one line from the file or for efficiency (depending upon the use case) more than one. Each block has a monotonically increasing ascending order id.
2) A thread pool works on the block of tasks created/ submitted the step above. Result of the processing is written to a priority queue (sorted based on the order id).
3) A thread reads from this priority queue - this step also needs to maintain a counter of the last task it read. So if the head of queue is 3 and last task had a sequence of 1, it needs to wait for task 2 to arrive.
The same can also be implemented in a event driven way using callbacks. There will be some memory requirements in step 3; for example event arrives for 1, then 3, 4 and then 2. So 3 and 4 need to be kept in memory till results of block 2 arrive.

Reading 30GB file using multithreading

I am trying to read a file a huge file of 30GB(25 million lines). I want to write a code which will create a thread pool and each thread will read 1000 lines in parallel (first thread would read first 1000 lines, second thread would read next 1000 and so on).
I have read the entire file and created thread pool but now I am stuck as to how can I ensure that each thread reads only 1000 lines and also keep track of the line numbers that been read so that the next thread does not have to read those lines.
A. If it's acceaptable all threads have approximately equal number of lines, you can:
Assume the thread pool size is N, 1st thread seeks to file offset 0 and read [0, 30GB/N), 2nd thread seeks to offset 30GB/N, read [30GB/N, 30GB/N*2) etc.
The 2nd thread may not at the beginning of a line, but at the middle of a line. It's ok. Just skip the paritial line, and read the complete line. The 1st thread may ends with partial line. It's ok, just keep reading until read the '\n'. the remaining threads do the same thing.
B. If all threads must have exactly euqal number of lines, that's say 1000 lines, you can:
Have one thread read the whole file, build the index map. The map has the information like line0~line999 starts at offset 0, line1000~line1999 starts at offset 13521, etc...
All the threads read the files from the accordingly offset, and read 1000 lines.
Approach A reads the file 1 time. Approach B reads the file 2 times.
With approach A or B, you can have all threads processing the file(transforming, extracting, cleaning..) parallelly. But if processing is very fast, the bound is disk speed. Then your application is IO bound. You should just have one thread read the file and do the processing serially.

How to detect that I am reading from a file when write is not completed?

We have a multithread program which does the following:
thread_1 is a listener of hard disk to detect a new file created. We use WatchService api in Java 7. When a new file is created by another program, thread_1 detects and get it and put it to a PriorityBlockingQueue ex:
priorityBlockingQueue.add(FileObject)
FileObjComparator is a custom object implement comparator. It is sorted by created time and fileCreatedTime field in FileObject I get from system time when detect this file:
public int compare(FileObject o1, FileObject o2) {
return o1.getFileCreatedTime().compareTo(o2.getFileCreatedTime());
}
priorityBlockingQueue is initializes as:
DataFileQueue.priorityBlockingQueue = new PriorityBlockingQueue<FileObject>(100000, new FileObjComparator());
and Thread_2 will process this next to the last file in this priorityBlockingQueue
if(priorityBlockingQueue.size) > 1)
process(priorityBlockingQueue.poll());
2 threads are running in parallel but when I process a number of large files, sometime Thread_2 process a file while it is being written. I detect this because recheck the content file and the result of processing.
This program is running on Centos 6.2, this hard disk partition is mounted in async mode. Thanks for any help.
If you really are processing the 2nd to last file then I'm surprised that the size of it is growing unless multiple processes or threads are generating the input files. Make sure that the other process that is creating the files flush and close each file before writing the next one.
You could read the file in blocks and then go back over a period of time to see if any additional data was added to the file and process it at that time using a RandomAccessFile. If you are reading a file line by line you would need to do your own pagination unfortunately. If the file is line based then you should make sure that the line termination characters close the file.
Another thing you can try is to delay the processing of the file a bit to let the file system flush its buffers. Ugly and unreliable but maybe necessary.
If you can adjust the output process then you could end the file with a magic string and then not process the file until the magic string is seen.
You could have the process the writes the file, write the size of the file into a separate file with a ".size" extension (or something). The size file would help you verify that you are reading the correct number of characters.
Another thing to try is to Runtime.exec("/bin/sync"); before you start reading from a file to synchronize the file system if you are running on ~unix system. The problem is that support for this is highly OS dependent. It also can be a real performance killer. He's the man page on my Mac:
The sync utility can be called to ensure that all disk writes have been completed
You can try using semaphores to organize access to each file, so as no file will get
written onto by more than one thread at a time. I think each file object should have its
own semaphore, and each thread should try to acquire the semaphore before writing to the
file.
Your Comparator should order by last modified time, not creation time. I don't see how you can know for example that two files opened in ordr A, B will be completely written in the same order, unless you positively know for a fact that file production is strictly sequential. You haven't said so.
EDIT a more detailed answer.
The problem is ...
You wrote:
It is sorted by created time and fileCreatedTime field in FileObject I get from system time when detect this file:
....
thread_1 is a listener of hard disk to detect a new file created. We use WatchService api in Java 7. When a new file is created by another program. ... thread_1 detects and get it and put it to a PriorityBlockingQueue ex:
The create time and the "file writing finished time", can be very different. (depending on the file size).
for example:
Open a File manager. Start downloading a about 60 mb File. Note the Create time. After about 3 minutes look at the final time.
to detect a new file, looking at the create time is the wrong moment to "put it to a PriorityBlockingQueue ex:"
thraed_1 have to wait until the file writing has finished. And then he can put it to "a PriorityBlockingQueue ex:"
How can I detect that the write is completed on a file ?
3 not too complicated options
a.) Compare the file is created and the file is ready time.
or
b.) Observe that the size of the file is growing steadily. If the
file is finished it stops growing.
or
c.) Try to move it to a temp folder.
What would you prefer ?
I would prefer solution c.
A file opened for writing can not be moved. After the 3rd party program closes the file it can be moved.
The necessary steps.
thread_1 is watching for created files by a 3rd party program.
thread_1 trying to move it to a xyztmp folder ( every 10 or 20 or ... seconds).
thread_1 looking for new incoming files in the
xyztmp folder and put it to a PriorityBlockingQueue ex.
solution b. is more complicated.
thread_1 put the incoming filenames and the size in a control array to compare 3-5 times.(every 5 seconds or more).
Array
(filenamexyz.dat, size1, size2, size3, ...).
(filenameabc.dat, size1, size2, size3, ...).
(filenamefgh.dat, size1, size2, size3, ...).
....
If a file identified by name every 5 comparative sizes are the same the 3rd party program has finished writing to this file.
Now it can be put to a PriorityBlockingQueue ex:
Let's look step by step
We assume thread_2 started when the list.size is 2 !
3rd party program starts writing files one by one.
3rd party program starts writing FILE_1.
thread_1 detects created FILE_1, put it in the list.
3rd party program finished writing FILE_1.
3rd party program starts writing FILE_2.
thread_1 detects created FILE_2, put it in the list.
if(priorityBlockingQueue.size) > 1) TRUE
thread_2 starts with reading and processing first file in the list FILE_1.
3rd party program finished writing FILE_2.
3rd party program starts writing FILE_3.
thread_1 detects created FILE_3, put it in the list.
thread_2 finshed processing FILE_1.
thread_2 starts with next file in the list FILE_2.
3rd party program finished writing FILE_3.
3rd party program starts writing FILE_4.
thread_1 detects created FILE_4, put it in the list.
thread_2 finshed processing FILE_2.
thread_2 starts with next file in the list FILE_3.
NOW THE TROUBLE STARTS
3rd party program finished writing FILE_4.
3rd party program starts writing FILE_5. (FILE_5 Larger then FILE_4).
thread_1 detects created FILE_5, put it in the list.
thread_2 finshed processing FILE_3.
thread_2 starts with next file in the list FILE_4.
thread_2 finshed processing FILE_4.
thread_2 starts with next file in the list FILE_5.
thread_2 finshed processing FILE_5.
3rd party program finished writing FILE_5.
If the file that the 3rd party program writes is larger and needs more time to write and thread_2 has finished reading the smaller FILE_4 .
thread_2 takes the next file out of the list - FILE_5, whether the file is ready to read or not.
FILE_5 is the file 3rd party program still writes.
FILE_5 is the file thread_2 is reading and processing. The bytes thread_2 reads are only the bytes 3rd party program has written at this time.

Categories