I was asked this question in an interview recently.
Given an input file, a regex and an output file. Read the input file, match each line to the regex and write the matching lines to an output file.
I came up with the rough scheme of using a BufferedReader chained to a FileReader (to optimize Reads from the disk). I used a similar scheme for writing.
The interviewer then said that this process takes 3 seconds to read a line from the file, 1 second to compare the regex with the line and another 5 seconds to write back. so it takes a total of 9 seconds per line. How can we improve this?
I suggested reading the entire file at once, processing it and writing the entire output file at once. However, I was told that won't help (Writing 1 line = 5 seconds, writing 2 lines = 10 seconds) .
The interviewer further said that this is due to a hardware/ hard drive limitation. I was asked how I can improve my code to reduce the total seconds (currently 9 ) per line?
I could only think of buffered reading/ writing and could not find much on SO as well. Any thoughts ?
I think that the interviewer was looking for a solution that performs reading/regex checking in parallel with writing the output. If you set up a work queue that you fill asynchronously by reading and filtering, and put writing in a separate thread, then the combined process would take five seconds per line, starting with the second line.
The assumption here is that reading, parsing, and writing can happen independently of each other. In this case, you can be reading line 2 while line 1 is being written: you need only four seconds to read and apply your regex, and you've got a whole five seconds before the writer is ready for the second line. The writing remains your bottleneck, but the whole process gets sped up by some 44%, which isn't bad.
Well since the time for a read is fixed and the time for a write is fixed, the only option you have in this case is to change the nature of the regex bit.
You could write code to apply the regex test quickly without the overhead of all the clever things that regex can do.
If on the other hand the problem is that each IO request takes several seconds to be executed, but the limitation is not the actual drive, then have several readers reading simultaneously.
Tough question, since we don't no much about the system.
My guess would be using threads/ async processing. Use one Thread to read and one to process an two or more for writing, thus reducing the time spent for IO Wait.
Let my try to convert this into an ASCII Chart:
R/r means reading (3 Sec)
P/p means processing (1 Sec)
W/w means writing (5 Sec)
upper case letter mark the start, lower case letter mark continuing work. A ":" means thread is idle
Thread 1: RrrRrrRrrRrrRr
Thread 2: ...P..P..P..P.
Thread 3: ....Wwwww
Thread 4: .......Wwwww
With this setup the first batch is written back after 9 Seconds (not much to do here) but the second one completes after 12 Seconds. Single threaded the second one needs 18 Seconds total
Related
I am quite new to Spring batch framework.
I am currently writing a batch with a reader and a writer.
Reader reads from Db and writer writes to a flat file. The number of records are 1 million. Writing to file takes a lot of time and I want to improve on that.
What is the best way I can achieve multithreading in writer so that write() method runs in parallel?
Note: In #BeforeStep and #AfterStep callbacks, I am writing header and footer of the file. write() method writes the records to file.
EDIT:
I have found out that, writing to file isn't taking much time but one of our internal method which does some sort of decryption takes about 500ms for 1 record. And we have 1 million such records.
Can I improve the performance by doing decryption in multiple threads? I am not able to understand how to improve from here on.
This is not really a Spring specific question. Typically what people do is they implement some kind of streaming. As in you don't read the entire query then write everything, but instead read little bits one after another, and then pass each bit onto the writer so it can already begin writing before you have finished reading. This is quicker and enables you also to not use as much memory. For instance if you had 10GB of data to read and write, you could split it up into 10MB queries, instead of reading the whole 10GB. You should read up on streams. Parallel writes to the same file however will result in no benefit, or reduced performance. You should also be careful not to start too many threads, again this will reduce performance, and unless your threads are really cheap I don't recommend really making more than 2, as some have already mentioned that the performance of most apps are I/O bound, and there is no getting around that, only mitigating effects by buffering/streaming/caching not blocking threads, or anything else you can do in your app.
I have two (Java) processes on different JVMs running repeatedly. The first one regularly finds some "information" and needs to store it somewhere. The second process regularly reads this information to handle it. The intervals are more or less random, so process 1 may find three pieces of information until process 2 reads them or vice versa.
My approach is to write this information to text files. But I am afraid that appending and reading the text files accidentally happens at the same time so that I run into locks. But writing a new text file for each piece of information seems like overkill.
What would be a better solution?
EDIT: I am sorry, I did not make clear: The java processes run in different JVMs. They cannot see each other directly.
You can get this to work, provided you are careful with file handling and you don't have a high update rate e.g. 10 updates per second.
Note: you could do it with file renaming instead of locks.
What would be a better solution?
Just about anything, SO is not for recommending things, but in this case I could recommend just about anything without more specific requirements. I could for example recommend my library Chronicle Queue because I wrote it and I sure it could do what you want, however there are many possible alternatives.
I am sending about one line of text every minute.
So you can write a temporary file for each message, rename it when finished. The consumer can have a directory watcher so it knows as soon as you have done this. The consumer could delete the file when done. This has an overhead but it would be less than 10 ms.
If you want to keep a record of all messages, the producer can also write to a log file.
I want to simulate CPU bound jobs in my simulator and i need a calculation or code that run for 1 second in the cpu ...how i will do it...
i am using the folllowing code
long Time1 = System.currentTimeMillis();
///calculation or loop that spends 1 second in cpu
long Time2 = System.currentTimeMillis();
System.out.println(Time2-Time1);
Now i need the calculation that take 1 second...I also need to simulate for 2 ,3 to 4 seconds
what code i should put in line 2.?
If you really meant binding a job to the CPU for 1 second, I don't think it is possible only with pure Java. The reason is that the OS will still remove the process from the CPU, schedule it again and so on for a number of times within 1 second. So you need to make a special kind of request to the OS to do this. The request should go and affect the process scheduling algorithm of the OS. But this is not how we want our applications to consume the CPU. We want the CPU to honor the interrupts and so on. So may be your intension is not clearly mentioned in the question or you might be trying to test something special and uncommon.
If you are just simulating something, may be you might just use a sleep() call as suggested in the comments which actually would not be consuming the CPU for 1 second but allows you to assume so for the simulation purpose.
Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.
Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores
Without reading the content of file you can't do that. That is not possible.
I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!
Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.
Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.
In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.
I have one text file that needs to be read by two threads, but I need to make the reading sequentially. Example: Thread 1 gets the lock and read first line, lock is free. Thread 2 gets the lock and read line 2, and so goes on.
I was thinking in sharing the same buffer reader or something like that, but I'm not so sure about it.
Thanks in advance!
EDITED
Will be 2 classes each one with a thread. Those 2 classes will read the same file.
You can lock the BufferReader as you say.
I would warn you that the performance is likely to be worse than using just one thread. However you can do it as an exercise.
It would probably be more performant to read the file line by line in one thread, and pass the resulting input lines to a thread pool via a queue such as ConcurrentLinkedQueue, if you want to guarantee order at least of start of processing of the files lines. Much simpler to implement, and no contention on whatever class you use to read the file.
Unless there's some cast-iron reason why you need the reading to happen local to each thread, I'd avoid sharing the file like this.