How to clone or copy a BufferedReader? - java

I'm creating an Android app which is going to read very large files, with about 1.000 to 40.000 lines, which takes quite a bit of time with a single loop. Therefore, I'm trying to create a multithreaded reader, which creates multiple threads, and each of them reads a specific part of the file, and then it puts all the small parts together in one big array or String.
I'm using a BufferedReader which loops through each line in the file, and store the line count.
Each time the loop run, I check if lineNumber % LINES_PER_READER == 0 is true. If it is, I create a new reader thread, which should read the next LINES_PER_THREAD-number of lines in the file.
I wonder (because the files can be huge) if I can copy or clone the BufferedReader in any way, so that the new reader thread can just start reading from the line where it was created, because I already have a loop which is reading that line, instead of creating a new BufferedReader, read each line until I get to the specified line and then start reading the actual values.

Don't clone the BufferedReader. It will create trouble. Just send batches of lines to the individual processing threads.

Related

The consequences of not closing a PrintWriter object

I wrote a really basic program which takes 2 equally sized lists of temperatures and creates a .CSV file using them. The weird part is that when I forgot to close the PrintWriter object and ran the program the output file was missing 300ish results (I don't know which results were missing) and depending on how the comma in the println was written either like this ", " or like this "," it would be missing a different number of results. When I closed the PrintWriter regardless of how the comma was written the output file would have the correct number of results. I was just wondering if anyone could explain why this was happening I thought closing the PrintWriter would just close it in the same was as closing a Scanner object would?
Don't have access to the code right now but it was just a for loop which would print the value of the current index of the 2 arrays in this format
PrintWriter.println(list1.get[i] + "," + list2.get[i];
Typically, output is collected in memory and only written to disk from time to time. This is done since larger disk writes are much more efficient.
When you don't close the writer you miss the last buffer full of output. But there are other negative consequences as well, the file will stay open until the program exits. If you do this repeatedly it will lead to resource leaks.
Aside from the writer content not being properly flushed and thus getting partially lost, every open writer hogs resources (RAM and OS file handles), and also blocks the file from being accessed by other processes.
Always close every writer and reader once you're done with it.

Add information to a bufferWriter before flushing it java

I understand that the BufferedWriter stores the information before I write in in a file before executing the act of writing it in a file using flush(), Append() etc..
I gather information from multiple sources, so currently what I am doing is looping in each source and appending it directly to the file each time, but what I'm trying to accomplish is to add all the information in the BufferedWriter and after finishing the loop, writing it to the file, how could that be done?
I am trying to improve performance by not flushing the data into the file so many times. The performance is issue because this might loop 1 million times.
Here is what I'm currently doing:
Open BufferedWriter
read data from a different source and storing in the buffer
appending stored data in a text file(here the buffer is emptied)
repeating steps 2.- and 3.- 50 times
closing text file
Here is what I'm trying to do:
Open BufferedWriter
read data from a different source and storing in the buffer
repeat step 2.- 50 times
append all the data collected(the data gathered over the 50 loops)
close the file
here is the code.
for (int mainLoop = 0; mainLoop < 50; mainLoop++){
try {
BufferedWriter writer = writer = new BufferedWriter(new
FileWriter
("path to file in computer" + mainLoop + ".txt", true));
for(int forloop = 0; forloop < 50; forloop++) {
final Document pageHtml=
Jsoup.connect("link to a page").get();
Elements body = pageHtml.select("p");
writer.append(System.getProperty("line.separator"));
writer.append(System.getProperty("line.separator"));
writer.append(body.text());
System.out.println(forloop);
}
writer.close();
} catch (IOException e) {
e.printStackTrace();
}continue;
}
i am trying to improve performance by not flushing the data into de file so many times
Are you flushing the data manually after each write? Don't do that.
Otherwise, specify a larger size when you instantiate your BufferedWriter.
You have the option of using a StringBuilder to aggregate the output first. However, I assume you have more output than you want to store in memory.
Finally, is there really a performance cost?
===
The BufferedWriter will optimize the actual writes it performs. As long as you specify a large buffer size, e.g., 10,000, multiple small writes to the buffer will not cause a write until the buffer is full. I see a comment that you are "clearing" the buffer. Don't do that. Leave the BufferedWriter alone and let it do its thing.
If you are accumulating information and then, for some reason, discarding it, use a StringBuilder to accumulate and then write the StringBuild content to the Writer.
A buffered writer will flush when you instruct it to. And also any time the buffer becomes full. It can be tricky to determine when buffer becomes full. And really, you should not care. A buffered writer will improve performance regardless of precisely when it flushes. Instead, your output code should use BufferedWriter just like any other Writer.
I also see in your code that you repeatedly open and close the output file. You almost certainly don't need to do that. Instead open and close the file at a higher level in your program, so it remains open for each iteration.

Reading 30GB file using multithreading

I am trying to read a file a huge file of 30GB(25 million lines). I want to write a code which will create a thread pool and each thread will read 1000 lines in parallel (first thread would read first 1000 lines, second thread would read next 1000 and so on).
I have read the entire file and created thread pool but now I am stuck as to how can I ensure that each thread reads only 1000 lines and also keep track of the line numbers that been read so that the next thread does not have to read those lines.
A. If it's acceaptable all threads have approximately equal number of lines, you can:
Assume the thread pool size is N, 1st thread seeks to file offset 0 and read [0, 30GB/N), 2nd thread seeks to offset 30GB/N, read [30GB/N, 30GB/N*2) etc.
The 2nd thread may not at the beginning of a line, but at the middle of a line. It's ok. Just skip the paritial line, and read the complete line. The 1st thread may ends with partial line. It's ok, just keep reading until read the '\n'. the remaining threads do the same thing.
B. If all threads must have exactly euqal number of lines, that's say 1000 lines, you can:
Have one thread read the whole file, build the index map. The map has the information like line0~line999 starts at offset 0, line1000~line1999 starts at offset 13521, etc...
All the threads read the files from the accordingly offset, and read 1000 lines.
Approach A reads the file 1 time. Approach B reads the file 2 times.
With approach A or B, you can have all threads processing the file(transforming, extracting, cleaning..) parallelly. But if processing is very fast, the bound is disk speed. Then your application is IO bound. You should just have one thread read the file and do the processing serially.

IO : Writing and reading to the same text file from a C++ program and from another Java program simultaneously?

Is it possible to read and write to the same text file with both a C++ application and a java application at the same time without writing conflicting lines / characters to it ? I have tested with two java applications for now, and it seems like it's possible to write to the file from one process even if the other process has opened the stream, but not closed it. Are there any way to lock the file so that the other process needs to wait ?
I think yes, for example boost::interprocess http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess.html file locks http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess/synchronization_mechanisms.html#interprocess.synchronization_mechanisms.file_lock
For two processes that are writing to the same file, as long as you flush your output buffers on line boundaries (i.e., flush after you write a newline character sequence), the data written to the file should be intervleaved nicely.
If one process is writing while another is reading from the same file, you just have to ensure that the reads don't get ahead of the writes. If a read gets an end-of-file condition (or worse, a partial data line), then you know that the reading process must wait until the writing process has finished writing another chunk of data to the file.
If you need more complicated read/write control, you should consider some kind of locking mechanism.

Splitting text file without reading it

Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.
Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores
Without reading the content of file you can't do that. That is not possible.
I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!
Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.
Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.
In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.

Categories