OOM when trying to process s3 file

OOM when trying to process s3 file - java

I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.
public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
GZIPInputStream gzis = null;
try {
gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
double mb = 0;
String readed;
int i=0;
while ((readed = br.readLine()) != null) {
mb = mb+readed.getBytes().length / (1024*1024);
i++;
if(i%100==0) {System.out.println(mb);}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
} finally {
closeStreams(gzis, in);
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332) at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuffer.append(StringBuffer.java:367) at
java.io.BufferedReader.readLine(BufferedReader.java:370) at
java.io.BufferedReader.readLine(BufferedReader.java:389) at
com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)

This is a theory, but I can't think of any other reasons why your example would OOM.
Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.
Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.
Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.
Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.
Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.
So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.
So 650MB could easily turn into a heap demand of > 6 x 650M bytes
The other thing to note that the 2 x N array has to be a single contiguous heap node.
Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.
So what is the solution?
It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.
public static String unzip(InputStream in)
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
) {
int ch;
long i = 0;
while ((ch = br.read()) >= 0) {
i++;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}

I also thought of the too long line.
On second thought I think the StringBuffer that is used internally by the JVM needs to be converted to the result type of readline: a String. Strings are immutable, but for speed reasons the JVM would not even lookup if a line is duplicate. So it may allocate the String many times, ultimately filling up the heap with no longer used String fragments.
My recommendation would be not to read lines or characters, but chunks of bytes. A byte[] is allocated on the heap and can be thrown away afterwards. Of course you would then count bytes instead of characters. Unless you know the difference and need characters that could be the more stable and performant solution.
This code is just written by memory and not tested:
public static String unzip(InputStream in)
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
) {
byte[] buffer = new byte[8192];
long i = 0;
int read = gzis.read(buffer);
while (read >= 0) {
i+=read;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
read = gzis.read(buffer);
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}```

Related

How to copy large data files line by line?

I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.
try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
br.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write(line + "\n");
});
}
}
This takes approx. 7 minutes. Is it possible to speed up that process even more?

If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.
Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.
Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.
Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.
A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.
try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);
This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.

you can try this:
try (BufferedWriter writer = new BufferedWriter(new FileWriter(targetFile), 1024 * 1024 * 64)) {
try (BufferedReader br = new BufferedReader(new FileReader(sourceFile), 1024 * 1024 * 64)) {
I think it will save you one or two minutes. the test can be done on my machine in about 4 minutes by specifying the buffer size.
could it be faster? try this:
final char[] cbuf = new char[1024 * 1024 * 128];
try (Writer writer = new FileWriter(targetFile)) {
try (Reader br = new FileReader(sourceFile)) {
int cnt = 0;
while ((cnt = br.read(cbuf)) > 0) {
// add your code to process/split the buffer into lines.
writer.write(cbuf, 0, cnt);
}
}
}
This should save you three or four minutes.
If that's still not enough. (The reason I guess you ask the question probably is you need to execute the task repeatedly). if you want to get it done in one minutes or even couple of seconds. then you should process the data and save it into db, then process the task by multiple servers.

Thanks to all your suggestions, the fastest I came up with was exchanging the writer with BufferedOutputStream, which gave approx 25% improvement:
try (BufferedReader reader = Files.newBufferedReader(Paths.get("sample.csv"))) {
try (BufferedOutputStream writer = new BufferedOutputStream(Files.newOutputStream(Paths.get("target.csv")), 1024 * 16)) {
reader.lines().parallel()
.filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
.forEach(line -> {
writer.write((line + "\n").getBytes());
});
}
}
Still the BufferedReader performs better than BufferedInputStream in my case.

InputStream.available() and reading file compeletly notes from oracle

according to :
Note that while some implementations of InputStream will return the
total number of bytes in the stream, many will not. It is never
correct to use the return value of this method to allocate a buffer
intended to hold all data in this stream.
from:
http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#available%28%29
and this note
In particular, code of the form
int n = in.available();
byte buf = new byte[n];
in.read(buf);
is not guaranteed to read all of the remaining bytes from the given input stream.
http://docs.oracle.com/javase/8/docs/technotes/guides/io/troubleshooting.html
dose it mean that using below function cause not to read file completely?
/**
* Reads a file from /raw/res/ and returns it as a byte array
* #param res Resources instance for Mosembro
* #param resourceId ID of resource (ex: R.raw.resource_name)
* #return byte[] if successful, null otherwise
*/
public static byte[] readRawByteArray(Resources res, int resourceId)
{
InputStream is = null;
byte[] raw = new byte[] {};
try {
is = res.openRawResource(resourceId);
raw = new byte[is.available()];
is.read(raw);
}
catch (IOException e) {
e.printStackTrace();
raw = null;
}
finally {
try {
is.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return raw;
}

available() returns the number of bytes that can be read without blocking. There is no necessary correlation between that number, which can be zero, and the total length of the file.

Yes it does not necessarily read all. Like RandomAccessFile.read(byte[]) as opposed to RandomAccessFile.readFully(byte[]). Furthermore the code actually physically reads 0 bytes.
It probably reads only the first block, if it were a slow device like a file system.
The principle:
The file is being read by the underlying system software, normally
buffered, so you have a couple of blocks already in memory, and
sometimes already reading further. The software reads asynchrone
blocks, and blocks if trying to read more than the system has
already read.
So in general one has in the software a read loop of a block, and regularly at a read the read operation blocks till the physical read sufficiently buffers.
To hope for a non-blocking you would need to do:
InputStream is = res.openRawResource(resourceId);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (;;) {
// Read bytes until no longer available:
for (;;) {
int n = is.available();
if (n == 0) {
break;
}
byte[] part = new byte[n];
int nread = is.read(part);
assert nread == n;
baos.write(part, 0, nread);
}
// Still a probably blocking read:
byte[] part = new byte[128];
int nread = is.read(part);
if (nread <= 0) {
break; // End of file
}
baos.write(part, 0, nread);
}
return baos.toByteArray();
Now, before you copy that code, simply do a blocking read loop immediately. I cannot see an advantage of using available() unless you can do something with partial data while reading the rest.

Why wrap a FileReader with a BufferedReader? [duplicate]

I was trying to read a file into an array by using FileInputStream, and an ~800KB file took about 3 seconds to read into memory. I then tried the same code except with the FileInputStream wrapped into a BufferedInputStream and it took about 76 milliseconds. Why is reading a file byte by byte done so much faster with a BufferedInputStream even though I'm still reading it byte by byte? Here's the code (the rest of the code is entirely irrelevant). Note that this is the "fast" code. You can just remove the BufferedInputStream if you want the "slow" code:
InputStream is = null;
try {
is = new BufferedInputStream(new FileInputStream(file));
int[] fileArr = new int[(int) file.length()];
for (int i = 0, temp = 0; (temp = is.read()) != -1; i++) {
fileArr[i] = temp;
}
BufferedInputStream is over 30 times faster. Far more than that. So, why is this, and is it possible to make this code more efficient (without using any external libraries)?

In FileInputStream, the method read() reads a single byte. From the source code:
/**
* Reads a byte of data from this input stream. This method blocks
* if no input is yet available.
*
* #return the next byte of data, or <code>-1</code> if the end of the
* file is reached.
* #exception IOException if an I/O error occurs.
*/
public native int read() throws IOException;
This is a native call to the OS which uses the disk to read the single byte. This is a heavy operation.
With a BufferedInputStream, the method delegates to an overloaded read() method that reads 8192 amount of bytes and buffers them until they are needed. It still returns only the single byte (but keeps the others in reserve). This way the BufferedInputStream makes less native calls to the OS to read from the file.
For example, your file is 32768 bytes long. To get all the bytes in memory with a FileInputStream, you will require 32768 native calls to the OS. With a BufferedInputStream, you will only require 4, regardless of the number of read() calls you will do (still 32768).
As to how to make it faster, you might want to consider Java 7's NIO FileChannel class, but I have no evidence to support this.
Note: if you used FileInputStream's read(byte[], int, int) method directly instead, with a byte[>8192] you wouldn't need a BufferedInputStream wrapping it.

A BufferedInputStream wrapped around a FileInputStream, will request data from the FileInputStream in big chunks (512 bytes or so by default, I think.) Thus if you read 1000 characters one at a time, the FileInputStream will only have to go to the disk twice. This will be much faster!

It is because of the cost of disk access. Lets assume you will have a file which size is 8kb. 8*1024 times access disk will be needed to read this file without BufferedInputStream.
At this point, BufferedStream comes to the scene and acts as a middle man between FileInputStream and the file to be read.
In one shot, will get chunks of bytes default is 8kb to memory and then FileInputStream will read bytes from this middle man.
This will decrease the time of the operation.
private void exercise1WithBufferedStream() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
BufferedInputStream bufferedInputStream = new BufferedInputStream(myFile);
boolean eof = false;
while (!eof) {
int inByteValue = bufferedInputStream.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed with buffered:" + (System.currentTimeMillis()-start));
}
private void exercise1() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
boolean eof = false;
while (!eof) {
int inByteValue = myFile.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed without buffered:" + (System.currentTimeMillis()-start));
}

RXTX java, inputStream does not return all the buffer

This is my code, I'm using rxtx.
public void Send(byte[] bytDatos) throws IOException {
this.out.write(bytDatos);
}
public byte[] Read() throws IOException {
byte[] buffer = new byte[1024];
int len = 20;
while(in.available()!=0){
in.read(buffer);
}
System.out.print(new String(buffer, 0, len) + "\n");
return buffer;
}
the rest of code is just the same as this, i just changed 2 things.
InputStream in = serialPort.getInputStream();
OutputStream out = serialPort.getOutputStream();
They are global variables now and...
(new Thread(new SerialReader(in))).start();
(new Thread(new SerialWriter(out))).start();
not exist now...
I'm sending this (each second)
Send(("123456789").getBytes());
And this is what i got:
123456789123
456789
123456789
1234567891
23456789
can anybody help me?
EDIT
Later, i got the better way to solve it. Thanks, this was the Read Code
public byte[] Read(int intEspera) throws IOException {
try {
Thread.sleep(intEspera);
} catch (InterruptedException ex) {
Logger.getLogger(COM_ClComunica.class.getName()).log(Level.SEVERE, null, ex);
}//*/
byte[] buffer = new byte[528];
int len = 0;
while (in.available() > 0) {
len = in.available();
in.read(buffer,0,528);
}
return buffer;
}
It was imposible for me to erase that sleep but it is not a problem so, thanks veer

You should indeed note that InputStream.available is defined as follows...
Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream. The next invocation might be the same thread or another thread. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.
As you can see, this is not what you expected. Instead, you want to check for end-of-stream, which is indicated by InputStream.read() returning -1.
In addition, since you don't remember how much data you have already read in prior iterations of your read loop, you are potentially overwriting prior data in your buffer, which is again not something you likely intended.
What you appear to want is something as follows:
private static final int MESSAGE_SIZE = 20;
public byte[] read() throws IOException {
final byte[] buffer = new byte[MESSAGE_SIZE];
int total = 0;
int read = 0;
while (total < MESSAGE_SIZE
&& (read = in.read(buffer, total, MESSAGE_SIZE - total)) >= 0) {
total += read;
}
return buffer;
}
This should force it to read up to 20 bytes, less in the case of reaching the end of the stream.
Special thanks to EJP for reminding me to maintain the quality of my posts and make sure they're correct.

Get rid of the available() test. All it is doing is telling you whether there is data ready to be read without blocking. That isn't the same thing as telling you where an entire message ends. There are few correct uses for available(), and this isn't one of them.
And advance the buffer pointer when you read. You need to keep track of how many bytes you have read so far, and use that as the 2nd parameter to read(), with buffer.length as the third parameter.

Trying to upload in chunks

I am trying to accomplish a large file upload on a blackberry. I am succesfully able to upload a file but only if I read the file and upload it 1 byte at a time. For large files I think this is decreasing performance. I want to be able to read and write at something more 128 kb at a time. If i try to initialise my buffer to anything other than 1 then I never get a response back from the server after writing everything.
Any ideas why i can upload using only 1 byte at a time?
z.write(boundaryMessage.toString().getBytes());
DataInputStream fileIn = fc.openDataInputStream();
boolean isCancel = false;
byte[]b = new byte[1];
int num = 0;
int left = buffer;
while((fileIn.read(b)>-1))
{
num += b.length;
left = buffer - num * 1;
Log.info(num + "WRITTEN");
if (isCancel == true)
{
break;
}
z.write(b);
}
z.write(endBoundary.toString().getBytes());

It's a bug in BlackBerry OS that appeared in OS 5.0, and persists in OS 6.0. If you try using a multi-byte read before OS 5, it will work fine. OS5 and later produce the behavior you have described.
You can also get around the problem by creating a secure connection, as the bug doesn't manifest itself for secure sockets, only plain sockets.

Most input streams aren't guaranteed to fill a buffer on every read. (DataInputStream has a special method for this, readFully(), which will throw an EOFException if there aren't enough bytes left in the stream to fill the buffer.) And unless the file is a multiple of the buffer length, no stream will fill the buffer on the final read. So, you need to store the number of bytes read and use it during the write:
while(!isCancel)
{
int n = fileIn.read(b);
if (n < 0)
break;
num += n;
Log.info(num + "WRITTEN");
z.write(b, 0, n);
}

Your loop isn't correct. You should take care of the return value from read. It returns how many bytes that were actually read, and that isn't always the same as the buffer size.
Edit:
This is how you usually write loops that does what you want to do:
OutputStream z = null; //Shouldn't be null
InputStream in = null; //Shouldn't be null
byte[] buffer = new byte[1024 * 32];
int len = 0;
while ((len = in.read(buffer)) > -1) {
z.write(buffer, 0, len);
}
Note that you might want to use buffered streams instead of unbuffered streams.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

OOM when trying to process s3 file - java

Related

How to copy large data files line by line?

InputStream.available() and reading file compeletly notes from oracle

Why wrap a FileReader with a BufferedReader? [duplicate]

RXTX java, inputStream does not return all the buffer

Trying to upload in chunks

Categories

Resources