Multithreading approach to find text pattern in files

Multithreading approach to find text pattern in files - java

Consider simple Java application which should traverse files tree in a disc to find specific pattern in the body of the file.
Wondering is it possible to achieve better performance, using multi-threading, for example when we find new folder we submit new Runnable in fixed ThreadPool. Runnable task should traverse folder to find out new folders etc. In my opinion this operation should be IO bound, not CPU bound, so spawning new Thread would not improve performance.
Does it depends of hard drive type ? ( hdd, ... etc)
Does it depends on OS type ?
IMHO the only thing that can be parallel is - to spawn new Thread for parsing file content to find out pattern in file body.
What is the common pattern to solve this problem it ? Should it be multi-threaded or single-threaded ?

I've performed some research in this area while working under test project, you can look at the project on github at: http://github.com/4ndrew/filesearcher. Of course the main problem is a disk I/O speed, but if you would use optimal count of threads to perform reading/searching in parallel you can get better results in common.
UPD: Also look at this article http://drdobbs.com/parallel/220300055

I did some experiments on just this question some time ago. In the end I concluded that I could achieve a far better improvement by changing the way I accessed the file.
Here's the file walker I eventually ended up using:
// 4k buffer size ... near-optimal for Windows.
static final int SIZE = 4 * 1024;
// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter h, FileInputStream f) throws FileNotFoundException, IOException {
// Use a mapped and buffered stream for best speed.
// See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
FileChannel ch = f.getChannel();
// How much I've read.
long red = 0L;
do {
// How much to read this time around.
long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
// Map a byte buffer to the file.
MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
// How much to get.
int nGet;
// Walk the buffer to the end or until the hunter has finished.
while (mb.hasRemaining() && h.ok()) {
// Get a max of 4k.
nGet = Math.min(mb.remaining(), SIZE);
// Get that much.
mb.get(buffer, 0, nGet);
// Offer each byte to the hunter.
for (int i = 0; i < nGet && h.ok(); i++) {
h.check(buffer[i]);
}
}
// Keep track of how far we've got.
red += read;
// Stop at the end of the file.
} while (red < ch.size() && h.ok());
// Finish off.
h.close();
ch.close();
f.close();
}

You stated it right that you need to determine if your task is CPU or IO bound and then decide could it benefit from multithreading or not. Generally disk operations are pretty slow so unless amount of data you need to parse and parsing complexity you might not benefit from multithreading much. I would just write a simple test - just to read files w/o parsing in single thread, measure it and then add parsing and see if it's much slower and then decide.
Perhaps good design would be to use two threads - one reader thread that reads the files and places data in (bounded) queue and then another thread (or better use ExecutorService) parses the data - it would give you nice separation of concerns and you could always tweak number of threads doing parsing. I'm not sure if it makes much sense to read disk with multiple threads (unless you need to read from multiple physical disks etc).

What you could do is this: implement a single-producer multi-consumer pattern, where one thread searches the disk, retrieves files and then the consumer threads process them.
You are right that in this case using multiple threads to scan the disk would not be beneficial, in fact it would probably degrade performance, since the disk needs to seek the next reading position every time, so you end up bouncing the disk between the threads.

Related

Java: Asynchronous concurrent writes to disk

I have a function in my Main thread which will write some data to disk. I don't want my Main thread to stuck (High Latency of Disk I/O) and creating a new thread just to write is an overkill. I have decided to use ExecutorService.
ExecutorService executorService = Executors.newFixedThreadPool(3);
Future future = executorService.submit(new Callable<Boolean>() {
public Boolean call() throws Exception {
logger.log(Level.INFO, "Writing data to disk");
return writeToDisk();
}
});
writeToDisk is the function which will write to disk
Is it a nice way to do? Could somebody refer better approach if any?
UPDATE: Data size will be greater than 100 MB. Disk bandwidth is 40 MBps, so the write operation could take couple of seconds. I don't want calling function to stuck as It has to do other jobs, So, I am looking for a way to schedule Disk I/O asynchronous to the execution of the calling thread.
I need to delegate the task and forget about it!

Your code looks good anyways I've used AsynchronousFileChannel from new non-blocking IO. The implementation uses MappedByteBuffer through FileChannel. It might give you the performance which #Chris stated. Below is the simple example:
public static void main(String[] args) {
String filePath = "D:\\tmp\\async_file_write.txt";
Path file = Paths.get(filePath);
try(AsynchronousFileChannel asyncFile = AsynchronousFileChannel.open(file,
StandardOpenOption.WRITE,
StandardOpenOption.CREATE)) {
asyncFile.write(ByteBuffer.wrap("Some text to be written".getBytes()), 0);
} catch (IOException e) {
e.printStackTrace();
}
}

There are two approaches that I am aware of, spin up a thread (or use a pool of threads as you have) or memory map a file and let the OS manage it. Both are good approaches, memory mapping a file can be as much as 10x faster than using Java writers/streams and it does not require a separate thread so I often bias towards that when performance is key.
Either way, as a few tips to optimize disk writing try to preallocate the file where possible. Resizing a file is expensive. Spinning disks do not like seeking, and SSDs do not like mutating data that has been previously written.
I wrote some benchmarks to help me explore this area awhile back, feel free to run the benchmarks yourself. Amongst them is an example of memory mapping a file.

I would agree with Õzbek to use a non-blocking approach. Yet, as pointed by Dean Hiller, we cannot close the AsynchronousFileChannel before-hand using a try with resources because it may close the channel before the write has completed.
Thus, I would add a CompletionHandler on asyncFile.write(…,new CompletionHandler< >(){…} to track completion and close the underlying AsynchronousFileChannel after conclusion of write operation. To simplify the use I turned the CompletionHandler into a CompletableFuture which we can easily chain with a continuation to do whatever we want after write completion. The final auxiliary method is CompletableFuture<Integer> write(ByteBuffer bytes) which returns a CompletableFuture of the final file index after the completion of the corresponding write operation.
I placed all this logic in an auxiliary class AsyncFiles that can be used like this:
Path path = Paths.get("output.txt")
AsyncFiles
.write(path, bytes)
.thenAccept(index -> /* called on completion from a background thread */)

Access File through multiple threads

I want to access a large file (file size may vary from 30 MB to 1 GB) through 10 threads and then process each line in the file and write them to another file through 10 threads. If I use only one thread to access the IO, the other threads are blocked. The processing takes some time almost equivalent to reading a line of code from file system. There is one more constraint, the data in the output file should be in the same order as that of the input file.
I want your thoughts on the design of this system. Is there any existing API to support concurrent access to files?
Also writing to same file may lead to deadlock.
Please suggest how to achieve this if I am concerned with time constraint.

I would start with three threads.
a reader thread that reads the data, breaks it into "lines" and puts them in a bounded blocking queue (Q1),
a processing thread that reads from Q1, does the processing and puts them in a second bounded blocking queue (Q2), and
a writer thread that reads from Q2 and writes to disk.
Of course, I would also ensure that the output file is on a physically different disk than the input file.
If processing tends to be faster slower than the I/O (monitor the queue sizes), you could then start experimenting with two or more parallell "processors" that are synchronized in how they read and write their data.

You should abstract from the file reading. Create a class that reads the file and dispatches the content to a various number of threads.
The class shouldn't dispatch strings, it should wrap them in a Line class that contains meta information, e. g. The line number, since you want to keep the original sequence.
You need a processing class, that does the actual work on the collected data. In your case there is no work to do. The class just stores the information, you can extend it someday to do additional stuff (E.g. reverse the string. Append some other strings, ...)
Then you need a merger class, that does some kind of multiway merge sort on the processing threads and collects all the references to the Line instances in sequence.
The merger class could also write the data back to a file, but to keep the code clean...
I'd recommend to create a output class, that again abstracts from all the file handling and stuff.
Of course you need much memory for this approach, if you are short on main memory. You'd need a stream based approach that kind of works inplace to keep the memory overhead small.
UPDATE Stream-based approach
Everthing stays the same except:
The Reader thread pumps the read data into a Balloon. This balloon has a certain number of Line instances it can hold (The bigger the number, the more main memory you consume).
The processing threads take Lines from the balloon, the reader pumps more lines into the balloon as it gets emptier.
The merger class takes the lines from the processing threads as above and the writer writes the data back to a file.
Maybe you should use FileChannel in the I/O threads, since it's more suited for reading big files and probably consumes less memory while handling the file (but that's just an estimated guess).

Any sort of IO whether it be disk, network, etc. is generally the bottleneck.
By using multiple threads you are exacerbating the problem as it is very likely only one thread can have access to the IO resource at one time.
It would be best to use one thread to read, pass off info to a worker pool of threads, and then writing directly from there. But again if the workers write to the same place there will be bottlenecks as only one can have the lock. Easily fixed by passing the data to a single writer thread.
In "short":
Single reader thread writes to BlockingQueue or the like, this gives it a natural ordered sequence.
Then worker pool threads wait on the queue for data, recording its sequence number.
Worker threads then write the processed data to another BlockingQueue this time attaching its original sequence number so that
The writer thread can take the data and write it in sequence.
This will likely yield the fastest implementation possible.

One of the possible ways will be to create a single thread that will read input file and put read lines into a blocking queue. Several threads will wait for data from this queue, process the data.
Another possible solution may be to separate file into chunks and assign each chunk to a separate thread.
To avoid blocking you can use asynchronous IO. You may also take a look at Proactor pattern from Pattern-Oriented Software Architecture Volume 2

You can do this using FileChannel in java which allows multiple threads to access the same file. FileChannel allows you to read and write starting from a position. See sample code below:
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
public class OpenFile implements Runnable
{
private FileChannel _channel;
private FileChannel _writeChannel;
private int _startLocation;
private int _size;
public OpenFile(int loc, int sz, FileChannel chnl, FileChannel write)
{
_startLocation = loc;
_size = sz;
_channel = chnl;
_writeChannel = write;
}
public void run()
{
try
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
ByteBuffer buff = ByteBuffer.allocate(_size);
if (_startLocation == 0)
Thread.sleep(100);
_channel.read(buff, _startLocation);
ByteBuffer wbuff = ByteBuffer.wrap(buff.array());
int written = _writeChannel.write(wbuff, _startLocation);
System.out.println("Read the channel: " + buff + ":" + new String(buff.array()) + ":Written:" + written);
}
catch (Exception e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
throws Exception
{
FileOutputStream ostr = new FileOutputStream("OutBigFile.dat");
FileInputStream str = new FileInputStream("BigFile.dat");
String b = "Is this written";
//ostr.write(b.getBytes());
FileChannel chnl = str.getChannel();
FileChannel write = ostr.getChannel();
ByteBuffer buff = ByteBuffer.wrap(b.getBytes());
write.write(buff);
Thread t1 = new Thread(new OpenFile(0, 10000, chnl, write));
Thread t2 = new Thread(new OpenFile(10000, 10000, chnl, write));
Thread t3 = new Thread(new OpenFile(20000, 10000, chnl, write));
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
write.force(false);
str.close();
ostr.close();
}
}
In this sample, there are three threads reading the same file and writing to the same file and do not conflict. This logic in this sample has not taken into consideration that the sizes assigned need not end at a line end etc. You will have find the right logic based on your data.

I have encountered a similar situation before and the way I've handled it is this:
Read the file in the main thread line by line and submit the processing of the line to an executor. A reasonable starting point on ExecutorService is here. If you are planning on using a fixed no of threads, you might be interested in Executors.newFixedThreadPool(10) factory method in the Executors class. The javadocs on this topic isn't bad either.
Basically, I'd submit all the jobs, call shutdown and then in the main thread continue to write to the output file in the order for all the Future that are returned. You can leverage the Future class' get() method's blocking nature to ensure order but you really shouldn't use multithreading to write, just like you won't use it to read. Makes sense?
However, 1 GB data files? If I were you, I'd be first interested in meaningfully breaking down those files.
PS: I've deliberately avoided code in the answer as I'd like the OP to try it himself. Enough pointers to the specific classes, API methods and an example have been provided.

Be aware that the ideal number of threads is limited by the hardware architecture and other stuffs (you could think about consulting the thread pool to calculate the best number of threads). Assuming that "10" is a good number, we proceed. =)
If you are looking for performance, you could do the following:
Read the file using the threads you have and process each one according to your business rule. Keep one control variable that indicates the next expected line to be inserted on the output file.
If the next expected line is done processing, append it to a buffer (a Queue) (it would be ideal if you could find a way to insert direct in the output file, but you would have lock problems). Otherwise, store this "future" line inside a binary-search-tree, ordering the tree by line position. Binary-search-tree gives you a time complexity of "O(log n)" for searching and inserting, which is really fast for your context. Continue to fill the tree until the next "expected" line is done processing.
Activates the thread that will be responsible to open the output file, consume the buffer periodically and write the lines into the file.
Also, keep track of the "minor" expected node of the BST to be inserted in the file. You can use it to check if the future line is inside the BST before starting searching on it.
When the next expected line is done processing, insert into the Queue and verify if the next element is inside the binary-search-tree. In the case that the next line is in the tree, remove the node from the tree and append the content of the node to the Queue and repeat the search if the next line is already inside the tree.
Repeat this procedure until all files are done processing, the tree is empty and the Queue is empty.
This approach uses
- O(n) to read the file (but is parallelized)
- O(1) to insert the ordered lines into a Queue
- O(Logn)*2 to read and write the binary-search-tree
- O(n) to write the new file
plus the costs of your business rule and I/O operations.
Hope it helps.

Spring Batch comes to mind.
Maintaining the order would require a post process step i.e Store the read index/key ordered in the processing context.The processing logic should store the processed information in context as well.Once processing is done you can then post process the list and write to file.
Beware of OOM issues though.

Since order need to be maintained, so problem in itself says that reading and writing cannot be done in parallel as it is sequential process, the only thing that you can do in parallel is processing of records but that also doesnt solve much with only one writer.
Here is a design proposal:
Use One Thread t1 to read file and store data into a LinkedBlockingQueue Q1
Use another Thread t2 to read data from Q1 and put into another LinkedBlockingQueue Q2
Thread t3 reads data from Q2 and writes into a file.
To make sure that you dont encounter OutofMemoryError you should initialize Queues with appropriate size
You can use a CyclicBarrier to ensure all thread complete their operation
Additionally you can set an Action in CyclicBarrier where you can do your post processing tasks.
Good Luck, hoping you get the best design.
Cheers !!

I have faced similar problem in past. Where i have to read data from single file, process it and write result in other file. Since processing part was very heavy. So i tried to use multiple threads. Here is the design which i followed to solve my problem:
Use main program as master, read the whole file in one go (but dont start processing). Create one data object for each line with its sequence order.
Use one priorityblockingqueue say queue in main, add these data objects into it. Share refernce of this queue in constructor of every thread.
Create different processing units i.e. threads which will listen on this queue. When we add data objects to this queue, we will call notifyall method. All threads will process individually.
After processing, put all results in single map and put results against with key as its sequence number.
When queue is empty and all threads are idle, means processing is done. Stop the threads. Iterate over map and write results to a file

Sorting file with multi threads

I am sorting big file with by reading into chunks (Arraylist), sorting each arraylist using Collections.sort with custom comparator and writing the sorted results into files and then applying merge sort algorithm on all files.
I do it in one thread.
Will I get any performance boost if I start a new thread for every Collections.sort()?
By this I mean the following:
I read from file into List, when List is full I start a new thread where I sort this List and write to temp file.
Meanwhile I continue to read from file and start a new thread when the list is full again...
Another question that I have:
What is better for sorting:
1)Arraylist that I fill and when it's full apply collections.sort()
2)TreeMap that i fill, I don't need to sort it. (it's sorts as I insert items)
NOTE: I use JAVA 1.5
UPDATE:
This is a code I want to use, the problem are that I am reusing datalines arraylist that is beeing used by threads and also I need to wait until all threads complete.
how do i fix?
int MAX_THREADS = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
List datalines = ArrayList();
try {
while (data != null) {
long currentblocksize = 0;
while ((currentblocksize <= blocksize) && (data = getNext()) != null) {
datalines.add(data);
currentblocksize += data.length();
}
executor.submit(new Runnable() {
public void run() {
Collections.sort(datalines,mycomparator);
vector.add(datalines);
}
});

I suggest you to implement the following scheme, known as a farm:
worker0
reader --> worker1 --> writer
...
workerN
Thus, one thread reads a chunk from the file, hands it to a worker thread (best practice is to have the workers as an ExecutorService) to sort it and then each worker sends their output to the writer thread to put in a temp file.
Edit: Ok, I've looked at your code. To fix the issue with the shared datalines, you can have a private member for each thread that stores the current datalines array that the thread needs to sort:
public class ThreadTask implements Runnable {
private List datalines = new ArrayList();
public ThreadTask(List datalines) {
this.datalines.add(datalines);
}
public void run() {
Collections.sort(datalines,mycomparator);
synchronized(vector) {
vector.add(datalines);
}
}
}
You also need to synchronize access to the shared vector collection.
Then, to wait for all threads in the ExecutorService to finish use:
executor.awaitTermination(30, TimeUnit.SECONDS);

Whether using threads will speed things up depends on whether you're limited by disk I/O or by CPU speed. This depends how fast your disks are (SSD is much faster than spinning hard disks), and on how complex your comparison function is. If the limit is disk I/O, then there's no point in adding threads or worrying about data structures, because those won't help you read the data from disk any faster. If the limit is CPU speed, you should run a profiler first to make sure your comparison function isn't doing anything slow and silly.

The answer to the first question is - yes. You will gain performance boost if you implement a parallelised version of the Merge Sort. More about this in this Dr.Dobbs article: http://drdobbs.com/parallel/229400239 .

If your process is CPU bound (which I suspect its not) you can see an improvement using multiple threads. If your process is IO bound, you need to improve your IO bandwidth and operation speed.

Parallelizing a sequential operation will improve performance in three cases:
You have a CPU-bound application, and have multiple cores that can do work without coordination. In this case, each core can do its work and you'll see linear speedup. If you don't have multiple cores, however, multi-threading will actually slow you down.
You have an IO-bound application, in which you're performing IO via independent channels. This is the case with an application server interacting with multiple sockets. The data on a given socket is relatively unimpeded by whatever's happening on other sockets. It is generally not the case with disk IO, unless you can ensure that your disk operations are going to separate spindles, and potentially separate controllers. You generally won't see much of a speedup here, because the application will still be spending much of its time waiting. However, it can lead to a much cleaner programming model.
You interleave IO and CPU. In this case one thread can be performing the CPU-intensive operation while the other thread waits on IO. The speedup, if any, depends on the balance between CPU and IO in the application; in many (most) cases, the CPU contribution is negligible compared to IO.
You describe case #3, and to determine the answer you'd need to measure your CPU versus IO. One way to do this is with a profiler: if 90% of your time is in FileInputStream.read(), then you're unlikely to get a speedup. However, if 50% of your time is there, and 50% is in Arrays.sort(), you will.
However, I saw one of your comments where you said that you're parsing the lines inside the comparator. If that's the case, and Arrays.sort() is taking a significant amount of time, then I'm willing to bet that you'd get more of a speed boost by parsing on read.

Splitting text file without reading it

Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.

Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores

Without reading the content of file you can't do that. That is not possible.

I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!

Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.

Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.

In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.

Reading a file by multiple threads

I have a 250Mb file to be read. And the application is multi threaded. If i allow all threads to read the file the memory starvation occurs.
I get out of memory error.
To avoid it. I want to have only one copy of the String (which is read from stream) in memory and i want all the threads to use it.
while (true) {
synchronized (buffer) {
num = is.read(buffer);
String str = new String(buffer, 0, num);
}
sendToPC(str);
}
Basically i want to have only one copy of string when all thread completed sending, i want to read second string and so on.

Why multiple threads? You only have one disk and it can only go so fast. Multithreading it won't help, almost certainly. And any software design that relies on having an entire file in memory is seriously flawed in the first place.
Suppose you define your problem?

I realize this is kind of late, but I think what you want here is to use the map function in the FileChannel class. Once you map a region of the file into memory, then all of your threads can read or write to that block of memory and the OS will synchronizes that memory region with the file periodically (or when you call MappedByteBuffer.load()), and if you want each thread to work with a different part of the file, then you can assign several maps each mapping a specific region of the file and just use one map per thread.
see the javadoc for FileChannel, RandomAccessFile, and MappedByteBuffer

Could you directly use streams instead of completely reading the file in to memory?

You can register all threads as callbacks in the File reading class. SO have something like an array or list of classes implementing an interface StringReaderThread which has the method processString(String input). After reading each line from the file, iterate over this array/list and call processString() on all the threads this way. Would this solve your problem?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.