Reading 30GB file using multithreading

Reading 30GB file using multithreading - java

I am trying to read a file a huge file of 30GB(25 million lines). I want to write a code which will create a thread pool and each thread will read 1000 lines in parallel (first thread would read first 1000 lines, second thread would read next 1000 and so on).
I have read the entire file and created thread pool but now I am stuck as to how can I ensure that each thread reads only 1000 lines and also keep track of the line numbers that been read so that the next thread does not have to read those lines.

A. If it's acceaptable all threads have approximately equal number of lines, you can:
Assume the thread pool size is N, 1st thread seeks to file offset 0 and read [0, 30GB/N), 2nd thread seeks to offset 30GB/N, read [30GB/N, 30GB/N*2) etc.
The 2nd thread may not at the beginning of a line, but at the middle of a line. It's ok. Just skip the paritial line, and read the complete line. The 1st thread may ends with partial line. It's ok, just keep reading until read the '\n'. the remaining threads do the same thing.
B. If all threads must have exactly euqal number of lines, that's say 1000 lines, you can:
Have one thread read the whole file, build the index map. The map has the information like line0~line999 starts at offset 0, line1000~line1999 starts at offset 13521, etc...
All the threads read the files from the accordingly offset, and read 1000 lines.
Approach A reads the file 1 time. Approach B reads the file 2 times.
With approach A or B, you can have all threads processing the file(transforming, extracting, cleaning..) parallelly. But if processing is very fast, the bound is disk speed. Then your application is IO bound. You should just have one thread read the file and do the processing serially.

Related

Process a text file line by line using parallelism but preserving order

I need to process the content of a plain text file line by line.
Since processing every single line requires some time-consuming processing (access to external resources), I'd like to execute it concurrently.
I could easily do that with a ThreadPoolExecutor but the problem is that I need to write the output maintaining the input order (even if I know that this would be non-optimal from a CPU usage standpoint).
Another constraint is that the input file could be huge, so keeping it all in memory in some sort of structure, is not an option.
Any idea?

You could use the typical Producer Consumer pattern.
1) A thread reading the input file and creating a block of work. This block can have one line from the file or for efficiency (depending upon the use case) more than one. Each block has a monotonically increasing ascending order id.
2) A thread pool works on the block of tasks created/ submitted the step above. Result of the processing is written to a priority queue (sorted based on the order id).
3) A thread reads from this priority queue - this step also needs to maintain a counter of the last task it read. So if the head of queue is 3 and last task had a sequence of 1, it needs to wait for task 2 to arrive.
The same can also be implemented in a event driven way using callbacks. There will be some memory requirements in step 3; for example event arrives for 1, then 3, 4 and then 2. So 3 and 4 need to be kept in memory till results of block 2 arrive.

How to clone or copy a BufferedReader?

I'm creating an Android app which is going to read very large files, with about 1.000 to 40.000 lines, which takes quite a bit of time with a single loop. Therefore, I'm trying to create a multithreaded reader, which creates multiple threads, and each of them reads a specific part of the file, and then it puts all the small parts together in one big array or String.
I'm using a BufferedReader which loops through each line in the file, and store the line count.
Each time the loop run, I check if lineNumber % LINES_PER_READER == 0 is true. If it is, I create a new reader thread, which should read the next LINES_PER_THREAD-number of lines in the file.
I wonder (because the files can be huge) if I can copy or clone the BufferedReader in any way, so that the new reader thread can just start reading from the line where it was created, because I already have a loop which is reading that line, instead of creating a new BufferedReader, read each line until I get to the specified line and then start reading the actual values.

Don't clone the BufferedReader. It will create trouble. Just send batches of lines to the individual processing threads.

Access File through multiple threads

I want to access a large file (file size may vary from 30 MB to 1 GB) through 10 threads and then process each line in the file and write them to another file through 10 threads. If I use only one thread to access the IO, the other threads are blocked. The processing takes some time almost equivalent to reading a line of code from file system. There is one more constraint, the data in the output file should be in the same order as that of the input file.
I want your thoughts on the design of this system. Is there any existing API to support concurrent access to files?
Also writing to same file may lead to deadlock.
Please suggest how to achieve this if I am concerned with time constraint.

I would start with three threads.
a reader thread that reads the data, breaks it into "lines" and puts them in a bounded blocking queue (Q1),
a processing thread that reads from Q1, does the processing and puts them in a second bounded blocking queue (Q2), and
a writer thread that reads from Q2 and writes to disk.
Of course, I would also ensure that the output file is on a physically different disk than the input file.
If processing tends to be faster slower than the I/O (monitor the queue sizes), you could then start experimenting with two or more parallell "processors" that are synchronized in how they read and write their data.

You should abstract from the file reading. Create a class that reads the file and dispatches the content to a various number of threads.
The class shouldn't dispatch strings, it should wrap them in a Line class that contains meta information, e. g. The line number, since you want to keep the original sequence.
You need a processing class, that does the actual work on the collected data. In your case there is no work to do. The class just stores the information, you can extend it someday to do additional stuff (E.g. reverse the string. Append some other strings, ...)
Then you need a merger class, that does some kind of multiway merge sort on the processing threads and collects all the references to the Line instances in sequence.
The merger class could also write the data back to a file, but to keep the code clean...
I'd recommend to create a output class, that again abstracts from all the file handling and stuff.
Of course you need much memory for this approach, if you are short on main memory. You'd need a stream based approach that kind of works inplace to keep the memory overhead small.
UPDATE Stream-based approach
Everthing stays the same except:
The Reader thread pumps the read data into a Balloon. This balloon has a certain number of Line instances it can hold (The bigger the number, the more main memory you consume).
The processing threads take Lines from the balloon, the reader pumps more lines into the balloon as it gets emptier.
The merger class takes the lines from the processing threads as above and the writer writes the data back to a file.
Maybe you should use FileChannel in the I/O threads, since it's more suited for reading big files and probably consumes less memory while handling the file (but that's just an estimated guess).

Any sort of IO whether it be disk, network, etc. is generally the bottleneck.
By using multiple threads you are exacerbating the problem as it is very likely only one thread can have access to the IO resource at one time.
It would be best to use one thread to read, pass off info to a worker pool of threads, and then writing directly from there. But again if the workers write to the same place there will be bottlenecks as only one can have the lock. Easily fixed by passing the data to a single writer thread.
In "short":
Single reader thread writes to BlockingQueue or the like, this gives it a natural ordered sequence.
Then worker pool threads wait on the queue for data, recording its sequence number.
Worker threads then write the processed data to another BlockingQueue this time attaching its original sequence number so that
The writer thread can take the data and write it in sequence.
This will likely yield the fastest implementation possible.

One of the possible ways will be to create a single thread that will read input file and put read lines into a blocking queue. Several threads will wait for data from this queue, process the data.
Another possible solution may be to separate file into chunks and assign each chunk to a separate thread.
To avoid blocking you can use asynchronous IO. You may also take a look at Proactor pattern from Pattern-Oriented Software Architecture Volume 2

You can do this using FileChannel in java which allows multiple threads to access the same file. FileChannel allows you to read and write starting from a position. See sample code below:
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
public class OpenFile implements Runnable
{
private FileChannel _channel;
private FileChannel _writeChannel;
private int _startLocation;
private int _size;
public OpenFile(int loc, int sz, FileChannel chnl, FileChannel write)
{
_startLocation = loc;
_size = sz;
_channel = chnl;
_writeChannel = write;
}
public void run()
{
try
{
System.out.println("Reading the channel: " + _startLocation + ":" + _size);
ByteBuffer buff = ByteBuffer.allocate(_size);
if (_startLocation == 0)
Thread.sleep(100);
_channel.read(buff, _startLocation);
ByteBuffer wbuff = ByteBuffer.wrap(buff.array());
int written = _writeChannel.write(wbuff, _startLocation);
System.out.println("Read the channel: " + buff + ":" + new String(buff.array()) + ":Written:" + written);
}
catch (Exception e)
{
e.printStackTrace();
}
}
public static void main(String[] args)
throws Exception
{
FileOutputStream ostr = new FileOutputStream("OutBigFile.dat");
FileInputStream str = new FileInputStream("BigFile.dat");
String b = "Is this written";
//ostr.write(b.getBytes());
FileChannel chnl = str.getChannel();
FileChannel write = ostr.getChannel();
ByteBuffer buff = ByteBuffer.wrap(b.getBytes());
write.write(buff);
Thread t1 = new Thread(new OpenFile(0, 10000, chnl, write));
Thread t2 = new Thread(new OpenFile(10000, 10000, chnl, write));
Thread t3 = new Thread(new OpenFile(20000, 10000, chnl, write));
t1.start();
t2.start();
t3.start();
t1.join();
t2.join();
t3.join();
write.force(false);
str.close();
ostr.close();
}
}
In this sample, there are three threads reading the same file and writing to the same file and do not conflict. This logic in this sample has not taken into consideration that the sizes assigned need not end at a line end etc. You will have find the right logic based on your data.

I have encountered a similar situation before and the way I've handled it is this:
Read the file in the main thread line by line and submit the processing of the line to an executor. A reasonable starting point on ExecutorService is here. If you are planning on using a fixed no of threads, you might be interested in Executors.newFixedThreadPool(10) factory method in the Executors class. The javadocs on this topic isn't bad either.
Basically, I'd submit all the jobs, call shutdown and then in the main thread continue to write to the output file in the order for all the Future that are returned. You can leverage the Future class' get() method's blocking nature to ensure order but you really shouldn't use multithreading to write, just like you won't use it to read. Makes sense?
However, 1 GB data files? If I were you, I'd be first interested in meaningfully breaking down those files.
PS: I've deliberately avoided code in the answer as I'd like the OP to try it himself. Enough pointers to the specific classes, API methods and an example have been provided.

Be aware that the ideal number of threads is limited by the hardware architecture and other stuffs (you could think about consulting the thread pool to calculate the best number of threads). Assuming that "10" is a good number, we proceed. =)
If you are looking for performance, you could do the following:
Read the file using the threads you have and process each one according to your business rule. Keep one control variable that indicates the next expected line to be inserted on the output file.
If the next expected line is done processing, append it to a buffer (a Queue) (it would be ideal if you could find a way to insert direct in the output file, but you would have lock problems). Otherwise, store this "future" line inside a binary-search-tree, ordering the tree by line position. Binary-search-tree gives you a time complexity of "O(log n)" for searching and inserting, which is really fast for your context. Continue to fill the tree until the next "expected" line is done processing.
Activates the thread that will be responsible to open the output file, consume the buffer periodically and write the lines into the file.
Also, keep track of the "minor" expected node of the BST to be inserted in the file. You can use it to check if the future line is inside the BST before starting searching on it.
When the next expected line is done processing, insert into the Queue and verify if the next element is inside the binary-search-tree. In the case that the next line is in the tree, remove the node from the tree and append the content of the node to the Queue and repeat the search if the next line is already inside the tree.
Repeat this procedure until all files are done processing, the tree is empty and the Queue is empty.
This approach uses
- O(n) to read the file (but is parallelized)
- O(1) to insert the ordered lines into a Queue
- O(Logn)*2 to read and write the binary-search-tree
- O(n) to write the new file
plus the costs of your business rule and I/O operations.
Hope it helps.

Spring Batch comes to mind.
Maintaining the order would require a post process step i.e Store the read index/key ordered in the processing context.The processing logic should store the processed information in context as well.Once processing is done you can then post process the list and write to file.
Beware of OOM issues though.

Since order need to be maintained, so problem in itself says that reading and writing cannot be done in parallel as it is sequential process, the only thing that you can do in parallel is processing of records but that also doesnt solve much with only one writer.
Here is a design proposal:
Use One Thread t1 to read file and store data into a LinkedBlockingQueue Q1
Use another Thread t2 to read data from Q1 and put into another LinkedBlockingQueue Q2
Thread t3 reads data from Q2 and writes into a file.
To make sure that you dont encounter OutofMemoryError you should initialize Queues with appropriate size
You can use a CyclicBarrier to ensure all thread complete their operation
Additionally you can set an Action in CyclicBarrier where you can do your post processing tasks.
Good Luck, hoping you get the best design.
Cheers !!

I have faced similar problem in past. Where i have to read data from single file, process it and write result in other file. Since processing part was very heavy. So i tried to use multiple threads. Here is the design which i followed to solve my problem:
Use main program as master, read the whole file in one go (but dont start processing). Create one data object for each line with its sequence order.
Use one priorityblockingqueue say queue in main, add these data objects into it. Share refernce of this queue in constructor of every thread.
Create different processing units i.e. threads which will listen on this queue. When we add data objects to this queue, we will call notifyall method. All threads will process individually.
After processing, put all results in single map and put results against with key as its sequence number.
When queue is empty and all threads are idle, means processing is done. Stop the threads. Iterate over map and write results to a file

How to overcome hardware limitations when reading/writing to a file.

I was asked this question in an interview recently.
Given an input file, a regex and an output file. Read the input file, match each line to the regex and write the matching lines to an output file.
I came up with the rough scheme of using a BufferedReader chained to a FileReader (to optimize Reads from the disk). I used a similar scheme for writing.
The interviewer then said that this process takes 3 seconds to read a line from the file, 1 second to compare the regex with the line and another 5 seconds to write back. so it takes a total of 9 seconds per line. How can we improve this?
I suggested reading the entire file at once, processing it and writing the entire output file at once. However, I was told that won't help (Writing 1 line = 5 seconds, writing 2 lines = 10 seconds) .
The interviewer further said that this is due to a hardware/ hard drive limitation. I was asked how I can improve my code to reduce the total seconds (currently 9 ) per line?
I could only think of buffered reading/ writing and could not find much on SO as well. Any thoughts ?

I think that the interviewer was looking for a solution that performs reading/regex checking in parallel with writing the output. If you set up a work queue that you fill asynchronously by reading and filtering, and put writing in a separate thread, then the combined process would take five seconds per line, starting with the second line.
The assumption here is that reading, parsing, and writing can happen independently of each other. In this case, you can be reading line 2 while line 1 is being written: you need only four seconds to read and apply your regex, and you've got a whole five seconds before the writer is ready for the second line. The writing remains your bottleneck, but the whole process gets sped up by some 44%, which isn't bad.

Well since the time for a read is fixed and the time for a write is fixed, the only option you have in this case is to change the nature of the regex bit.
You could write code to apply the regex test quickly without the overhead of all the clever things that regex can do.
If on the other hand the problem is that each IO request takes several seconds to be executed, but the limitation is not the actual drive, then have several readers reading simultaneously.

Tough question, since we don't no much about the system.
My guess would be using threads/ async processing. Use one Thread to read and one to process an two or more for writing, thus reducing the time spent for IO Wait.
Let my try to convert this into an ASCII Chart:
R/r means reading (3 Sec)
P/p means processing (1 Sec)
W/w means writing (5 Sec)
upper case letter mark the start, lower case letter mark continuing work. A ":" means thread is idle
Thread 1: RrrRrrRrrRrrRr
Thread 2: ...P..P..P..P.
Thread 3: ....Wwwww
Thread 4: .......Wwwww
With this setup the first batch is written back after 9 Seconds (not much to do here) but the second one completes after 12 Seconds. Single threaded the second one needs 18 Seconds total

BufferedReader in a multi-threaded environment

How to read from BufferedReader simultaneously by multiple threads.

Well, you won't be able to have them actually simultaneously performing a read. However, you could:
Synchronize all the reads on one lock, so that only one thread tries to read at a time, but they can all read eventually
Have one thread just reading, and make it populate a thread-safe queue of some sort (see java.util.concurrent for various options) which the other threads fetch items from.
Are you wanting to read lines at a time, or arbitrary blocks of characters?

If all threads are to read all lines from the file, then you should create a separate buffered reader for each thread. If each thread is processing one line at a time (and the order of lines don't matter), then you should probably use the producer/consumer model, where only one thread actually reads from the file and places the workload in a BlockingQueue, while other threads periodically remove work loads and process them. Note that you will be able to redue locking overhead if you read N lines into a list, and then place the list in the blocking queue, instead of placing each individual line directly in the blocking queue, since that will allow multiple lines to be read/extracted with a single synchronization operation... placing and removing each and every line directly into/out of the queue will be very inefficient, especially if processing them is fairly quick.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.