I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.
try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
br.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write(line + "\n");
});
}
}
This takes approx. 7 minutes. Is it possible to speed up that process even more?
If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.
Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.
Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.
Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.
A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.
try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);
This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.
you can try this:
try (BufferedWriter writer = new BufferedWriter(new FileWriter(targetFile), 1024 * 1024 * 64)) {
try (BufferedReader br = new BufferedReader(new FileReader(sourceFile), 1024 * 1024 * 64)) {
I think it will save you one or two minutes. the test can be done on my machine in about 4 minutes by specifying the buffer size.
could it be faster? try this:
final char[] cbuf = new char[1024 * 1024 * 128];
try (Writer writer = new FileWriter(targetFile)) {
try (Reader br = new FileReader(sourceFile)) {
int cnt = 0;
while ((cnt = br.read(cbuf)) > 0) {
// add your code to process/split the buffer into lines.
writer.write(cbuf, 0, cnt);
}
}
}
This should save you three or four minutes.
If that's still not enough. (The reason I guess you ask the question probably is you need to execute the task repeatedly). if you want to get it done in one minutes or even couple of seconds. then you should process the data and save it into db, then process the task by multiple servers.
Thanks to all your suggestions, the fastest I came up with was exchanging the writer with BufferedOutputStream, which gave approx 25% improvement:
try (BufferedReader reader = Files.newBufferedReader(Paths.get("sample.csv"))) {
try (BufferedOutputStream writer = new BufferedOutputStream(Files.newOutputStream(Paths.get("target.csv")), 1024 * 16)) {
reader.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write((line + "\n").getBytes());
});
}
}
Still the BufferedReader performs better than BufferedInputStream in my case.
Related
I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.
public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
GZIPInputStream gzis = null;
try {
gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
double mb = 0;
String readed;
int i=0;
while ((readed = br.readLine()) != null) {
mb = mb+readed.getBytes().length / (1024*1024);
i++;
if(i%100==0) {System.out.println(mb);}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
} finally {
closeStreams(gzis, in);
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332) at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367) at
java.io.BufferedReader.readLine(BufferedReader.java:370) at
java.io.BufferedReader.readLine(BufferedReader.java:389) at
com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)
This is a theory, but I can't think of any other reasons why your example would OOM.
Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.
Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.
Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.
Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.
Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.
So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.
So 650MB could easily turn into a heap demand of > 6 x 650M bytes
The other thing to note that the 2 x N array has to be a single contiguous heap node.
Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.
So what is the solution?
It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.
public static String unzip(InputStream in)
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
) {
int ch;
long i = 0;
while ((ch = br.read()) >= 0) {
i++;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}
I also thought of the too long line.
On second thought I think the StringBuffer that is used internally by the JVM needs to be converted to the result type of readline: a String. Strings are immutable, but for speed reasons the JVM would not even lookup if a line is duplicate. So it may allocate the String many times, ultimately filling up the heap with no longer used String fragments.
My recommendation would be not to read lines or characters, but chunks of bytes. A byte[] is allocated on the heap and can be thrown away afterwards. Of course you would then count bytes instead of characters. Unless you know the difference and need characters that could be the more stable and performant solution.
This code is just written by memory and not tested:
public static String unzip(InputStream in)
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
) {
byte[] buffer = new byte[8192];
long i = 0;
int read = gzis.read(buffer);
while (read >= 0) {
i+=read;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
read = gzis.read(buffer);
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}```
I have a java program that sends a series of GET requests to a webservice and stores the response body as a text file.
I have implemented the following example code (filtered much of the code to highlight the concerned) which appends the text file and writes as a new line at the EOF. The code, however, works perfectly but the performances suffers as the size of the file grows bigger.
The total size of data is almost 4 GB and appends about 500 KB to 1 MB of data on avg.
do
{
//send the GET request & fetch data as string
String resultData = HTTP.GET <uri>;
// buffered writer to create a file
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true));
//write or append the file
writer.write(resultData + "\n");
}
while(resultData.exists());
These files are created on daily basis and moved to hdfs for hadoop consumption and as a real-time archive. Is there a better way to achieve this?
1) You are opening a new writer every time, without closing the previous writer object.
2) Don't open the file for each write operation, instead open it before the loop, and close it after the loop.
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true));
do{
String resultData = HTTP.GET <uri>;
writer.write(resultData + "\n");
}while(resultData.exists());
writer.close();
3) Default buffered size of BufferedWriter is 8192 characters, Since you have 4 GB of data, I would increase the buffer size, to improve the performance but at the same time make sure your JVM has enough memory to hold the data.
BufferedWriter writer = new BufferedWriter(new FileWriter(path, true), 8192 * 4);
do{
String resultData = HTTP.GET <uri>;
writer.write(resultData + "\n");
}while(resultData.exists());
writer.close();
4) Since you are making a GET web service call, the performance depends on the response time of webservice also.
According to this answer Java difference between FileWriter and BufferedWriter what you are doing right now is inefficient.
The code you provided is incomplete. Brackets are missing, no close statement for the writer. But if I understand correctly for every resultData you open a new buffered writer and call write once . This means that you should use the FileWriter directly, since the way you are doing it, the buffer is just an overhead.
If what you want it to get data in a loop and write them in a single file, then you should do something like this
try( BufferedWriter writer = new BufferedWriter(new FileWriter("PATH_HERE", true)) ) {
String resultData = "";
do {
//send the GET request & fetch data as string
resultData = HTTP.GET <uri>;
//write or append the file
writer.write(resultData + "\n");
} while(resultData != null && !resultData.isEmpty());
} catch(Exception e) {
e.printStackTrace();
}
The above uses try with resources, which will handle closing the writer after exiting the try block. This is available in java 7.
I was trying to read a file into an array by using FileInputStream, and an ~800KB file took about 3 seconds to read into memory. I then tried the same code except with the FileInputStream wrapped into a BufferedInputStream and it took about 76 milliseconds. Why is reading a file byte by byte done so much faster with a BufferedInputStream even though I'm still reading it byte by byte? Here's the code (the rest of the code is entirely irrelevant). Note that this is the "fast" code. You can just remove the BufferedInputStream if you want the "slow" code:
InputStream is = null;
try {
is = new BufferedInputStream(new FileInputStream(file));
int[] fileArr = new int[(int) file.length()];
for (int i = 0, temp = 0; (temp = is.read()) != -1; i++) {
fileArr[i] = temp;
}
BufferedInputStream is over 30 times faster. Far more than that. So, why is this, and is it possible to make this code more efficient (without using any external libraries)?
In FileInputStream, the method read() reads a single byte. From the source code:
/**
* Reads a byte of data from this input stream. This method blocks
* if no input is yet available.
*
* #return the next byte of data, or <code>-1</code> if the end of the
* file is reached.
* #exception IOException if an I/O error occurs.
*/
public native int read() throws IOException;
This is a native call to the OS which uses the disk to read the single byte. This is a heavy operation.
With a BufferedInputStream, the method delegates to an overloaded read() method that reads 8192 amount of bytes and buffers them until they are needed. It still returns only the single byte (but keeps the others in reserve). This way the BufferedInputStream makes less native calls to the OS to read from the file.
For example, your file is 32768 bytes long. To get all the bytes in memory with a FileInputStream, you will require 32768 native calls to the OS. With a BufferedInputStream, you will only require 4, regardless of the number of read() calls you will do (still 32768).
As to how to make it faster, you might want to consider Java 7's NIO FileChannel class, but I have no evidence to support this.
Note: if you used FileInputStream's read(byte[], int, int) method directly instead, with a byte[>8192] you wouldn't need a BufferedInputStream wrapping it.
A BufferedInputStream wrapped around a FileInputStream, will request data from the FileInputStream in big chunks (512 bytes or so by default, I think.) Thus if you read 1000 characters one at a time, the FileInputStream will only have to go to the disk twice. This will be much faster!
It is because of the cost of disk access. Lets assume you will have a file which size is 8kb. 8*1024 times access disk will be needed to read this file without BufferedInputStream.
At this point, BufferedStream comes to the scene and acts as a middle man between FileInputStream and the file to be read.
In one shot, will get chunks of bytes default is 8kb to memory and then FileInputStream will read bytes from this middle man.
This will decrease the time of the operation.
private void exercise1WithBufferedStream() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
BufferedInputStream bufferedInputStream = new BufferedInputStream(myFile);
boolean eof = false;
while (!eof) {
int inByteValue = bufferedInputStream.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed with buffered:" + (System.currentTimeMillis()-start));
}
private void exercise1() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
boolean eof = false;
while (!eof) {
int inByteValue = myFile.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed without buffered:" + (System.currentTimeMillis()-start));
}
I am trying to read a text file which contains about 1000 very long lines. Entire file stands at about 1.4MB.
I am using BufferedReader's readLine method to read file. What happens is it takes 8-10 seconds to print the output on console. I tried the same using fgets of php and it prints all the same lines in blink of an eye!!! How is it possible?
Below is the code I am using
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;
public class ClickLogDataImporter {
public static void main(String [] args) {
try {
new ClickLogDataImporter().getFileData();
} catch (Exception ex) {
Logger.getLogger(ClickLogDataImporter.class.getName()).log(Level.SEVERE, null, ex);
}
}
public void getFileData() throws FileNotFoundException, IOException {
String path = "/home/shantanu/Documents";
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(path+"/sample.txt")));
String line = "";
while((line = (br.readLine())) != null) {
System.out.println(line);
}
}
}
PHP code
<?php
$fileName = "/home/shantanu/Documents/sample.txt";
$file = fopen($fileName, 'r');
while(($line = fgets($file)) != false) {
echo $line."\n";
}
?>
Please enlighten me about this issue
I'm not sure but I think PHP just prints the file given the method you used, Java reads the file and gets every lines from it, that means checking every character for a line breaker, the process does not seem to be the same at all.
string file_get_contents
If you try and print each line one by one from the file with PHP, it should be slower.
8 seconds for that code sounds much too long to me. I suspect something else is going on, to be honest. Are you sure it's not console output which is taking a long time?
I suggest you time it (e.g. with System.nanoTime) writing out the total time at the end, but run it with a console minimized. I suspect you'll find it's fast enough then.
Isn't that just the console output that is slow? Now that you know that you're file is read correctly, try by commenting out the line System.out.println(line);.
file_get_contents loads all the file contents into a String, with your code in Java you are reading and printing line by line.
If you are testing inside an IDE like Eclipse, the console output can be quite slow.
If you want the exact behavior of file_get_contents, you can use this dirty code :
File f = new File(path, "sample.txt");
ByteArrayOutputStream bos = new ByteArrayOutputStream(new Long(Math.min(Integer.MAX_VALUE, f.length())).intValue());
FileInputStream fis = new FileInputStream(f);
byte[] buf = new byte[1024 * 8];
int size;
while((size = fis.read(buf)) > 0) {
bos.write(buf, 0, size);
}
fis.close();
bos.close();
System.out.println(new String(bos.toByteArray()));
Well if you r using readline it will go and read the file 1000 times for each line . Try using the read function with a very big buffer say over 28000 or so. It will then read a file say a total of 60 times for 1.4 MB which is much lesser than 1000. If u use a small buffer of 1000, then its gonna read the file around 1300 or something which is even slower than 1000( readline ). Also while printing the lines use print instead of println since the lines are not exactly lines but an array of characters.
Readers are usually slow, you should try Stream readers which are fast. And make sure that FIlE opening process is not taking time. If File is opened and stream objects are created and then measure time, then you can figure out exactly it is due to File opening issue or reading the file issue. Make sure that system io load is not high at the time of this operation, otherwise you measurement will go bad.
BufferedInputStream reader=new BufferedInputStream(new FileInputStream("/home/shantanu/Documents/sample.txt"));
byte[] line=new byte[1024];
while(reader.read(line)>0) {
System.out.println(new String(line));
}
Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.
edit: more data
If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.
edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.
The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.
Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.
Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).
InputStream in = null;
try
{
in = new FileInputStream("foo");
in = new BufferedInputStream(in);
in = new GZIPInputStream(in);
// do something with the stream
}
finally
{
IOUtils.closeQuietly(in);
}
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.