I am quite new to Spring batch framework.
I am currently writing a batch with a reader and a writer.
Reader reads from Db and writer writes to a flat file. The number of records are 1 million. Writing to file takes a lot of time and I want to improve on that.
What is the best way I can achieve multithreading in writer so that write() method runs in parallel?
Note: In #BeforeStep and #AfterStep callbacks, I am writing header and footer of the file. write() method writes the records to file.
EDIT:
I have found out that, writing to file isn't taking much time but one of our internal method which does some sort of decryption takes about 500ms for 1 record. And we have 1 million such records.
Can I improve the performance by doing decryption in multiple threads? I am not able to understand how to improve from here on.
This is not really a Spring specific question. Typically what people do is they implement some kind of streaming. As in you don't read the entire query then write everything, but instead read little bits one after another, and then pass each bit onto the writer so it can already begin writing before you have finished reading. This is quicker and enables you also to not use as much memory. For instance if you had 10GB of data to read and write, you could split it up into 10MB queries, instead of reading the whole 10GB. You should read up on streams. Parallel writes to the same file however will result in no benefit, or reduced performance. You should also be careful not to start too many threads, again this will reduce performance, and unless your threads are really cheap I don't recommend really making more than 2, as some have already mentioned that the performance of most apps are I/O bound, and there is no getting around that, only mitigating effects by buffering/streaming/caching not blocking threads, or anything else you can do in your app.
Related
I'm trying to read a large (in GBs) file with JSON lines, do some 'processing' and write the result to another file.
I'll be using GSON streaming API for the purpose.
To speed up the processing, I'd like to multithread the 'processsing' part.
I'm reading the file line by line since I can't load the whole file in memory. My 'processing' depends on two different lines(possibly thousands of lines apart) that meet certain conditions. Is it possible to multithread this 'processing' , without loading the whole thing in memory ?
Any suggestions on how to go about this ?
Well a high level design would be to have a reader thread, a writer thread and an ExecutorService instance to do the processing.
The reader thread reads the JSON file using a streaming API1. When it has identified a unit of work to be performed, it creates a task and submits it to the executor service, and repeats.
The executor server processes the tasks it is given. You should use a service with a bounded thread pool, and possibly a bounded / blocking work queue.
The writer thread scans the Future objects created by task submission, and uses them to get the task results (in order), generate the output from the results and write the output to the file.
If the output file doesn't need to be in order, you could dispense with writer thread2, and have the tasks write to the file. They will need to use a shared lock or mutex so that only one tasking is writing to the file at a time.
1 - If you don't, then: 1) you need to be able to parse and hold the entire input file in memory, and 2) the reader thread won't be able to start submitting tasks until it has finished parsing the input.
2 - Do this if it simplifies things, not for performance reasons. The need for mutual exclusion while writing kills any hypothetical performance benefits.
As #Thilo notes, there is little to be gained by trying to have multiple reader threads. (And a whole lot of complexity if you try!)
I think you'll have a single process reading from the file which adds workers (Runnable/Callable) to a queue. You then have a pool of threads which consumes from the queue and executes the workers in parallel.
See Executors static methods which can help creating a ExecutorService
I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.
I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.
I have to process around 2 million text files and generate there triples.
Suppose I have a txt file xyz.txt(one of the files of 2 million input) , it is processed as below:
start(xyz.txt)---->module1(xyz.tpd)------>module2(xyz.adv)-------->module3(xyz.tpl)
suggest me a logic or concept so that i can process faster and in an optimized way on x64 4GB windows systems.
module1(working): it parses the txt file using a .bat file in which parser is invoked, it is a separate system thread and after 15 seconds it again starts parsing another txt file, and so on....
module2(working): it accepts .tpd file as input and generates .adv file.
module3(working): it accepts .adv file as input and generates .tpl(triples).
should i start threads from txt files or at some other point..?
i am afraid that if i the CPU get stuck in context switching.
can anyone have a better logic, so that i can try it..!?
Use a ThreadPoolExecutor .Tune it's parameters like number of active threads and others to suit your environment and system.
Most importantly, you have to write the program, profile it, and see where the bottleneck is. It is more than probable that the disk I/O operations will be the bottleneck and no amount of multithreading will solve your problems.
In that case using two(three? four?) separate hard drives may yield more speed gain than the best multithreaded solution.
Furthermore, the general rule is that you should optimize your application only when you have working code and you really know what to optimize. Profile, profile, profile.
Taking the future multithreaded optimizations into account when writing is OK; the architecture should be flexible enough to allow for future optimizations.
There is not much told here about your hardware environment; but the basic solution would be to use a fixed-size ExecutorService, where the size would, at first, be the number of your execution units:
private static final int NR_CPUS = Runtime.getRuntime().availableProcessors();
// Then:
final ExecutorService executor = Executors.newFixedThreadPool(NR_CPUS);
Then, for each file, you can create a Runnable to process it, and submit it to the thread pool using its .execute() method.
Note that .execute() is asynchronous; if the submitted runnable cannot be run right now, it will be queued.
..sounds like a typical batch application needed for data integration. Although, I do not intend to throw hyperlinks without completely understanding your needs at you, but, probably you need a solution which should work in a single VM and over the period of time you like to extend the solution for multiple VM/machines.. and may be we are not dealing with PBs of data to start with.. try Spring Batch not only will it solve the problem in the given context you will learn to structure your thoughts (think vocabulary!) to solve similar problems..
As a starting point, I would create one IO thread and a pool of CPU threads. The IO thread reads in text files and offers them to a BlockingQueue, while the CPU threads take the files from the BlockingQueue and process them. Then profile the application to see how many CPU threads you should use to keep pace with the IO thread (you can also dynamically determine this, e.g. start with one CPU thread and start another when the size of the BlockingQueue exceeds a threshold, probably something along the lines of 20 files). It's possible that you'll find that you only need one CPU thread to keep pace with the IO thread, in which case your program is IO bound and you'll need to e.g. place the text files next to each other on disk (so that you can use sequential reads on all but the first file) or put them on separate disks in order to speed up the application; one idea is to zip the files together and read them in with a ZipInputStream - this will reduce the number of disk seeks when reading the files and will also reduce the amount of data you need to read
I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.