Read S3 Object and write into InMemory Buffer - java

I am trying to read from S3 and writing into InMemory buffer like:
def inMemoryDownload(bucketName: String, key: String): String = {
val s3Object = s3client.getObject(new GetObjectRequest(bucketName, key))
val s3Stream = s3Object.getObjectContent()
val outputStream = new ByteArrayOutputStream()
val buffer = new Array[Byte](10* 1024)
var bytesRead:Int =s3Stream.read(buffer)
while (bytesRead > -1) {
info("writing.......")
outputStream.write(buffer)
info("reading.......")
bytesRead = ss3Stream.read(buffer)
}
val data = new String(outputStream.toByteArray)
outputStream.close()
s3Object.getObjectContent.close()
data
}
But It is giving me heap space error(Size of file on S3 is 4MB)

You should be using thbytes you just read, when writing into the stream. The way you have it written, writes the entire buffer every time. I doubt that is the cause of your memory problem, but it could be. Imagine that read returns a single byte to you every time, and you write 10K into the stream. That's 40G, right there.
Another problem is, that, I am not 100% sure, but I suspect, that getObjectObject creates a new input stream every time. Basically, you just keep reading the same bytes over and over again in the loop. You should put it into a variable instead.
Also, if I may make a suggestion, try rewriting your code in actual scala, not just syntactically, but idiomatically. Avoid mutable state, and use functional transformations. If you are going to write scala code might as well take some time to get into the right mind set. You'll grow to appreciate it eventually, I promise :)
Something like this, perhaps?
val input = s3Object.getObjectContent
Stream
.continually(input.read(buffer))
.takeWhile(_ > 0)
.foreach { output.write(buffer, 0, _) }

Related

FileInputStream and DataOutputStream - handling byte[] buffer

I've been working on an app to move files between two hosts and while I got the transfer process to work (code is still really messy so sorry for that, I'm still fixing it) I'm kinda left wondering how exactly it handles the buffer. I'm fairly new to networking in java so I just don't want to end up with "meh i got it to work so let's move on" attitude.
File sending code.
public void sendFile(String filepath, DataOutputStream dos) throws Exception{
if (new File(filepath).isFile()&&dos!=null){
long size = new File(filepath).length();
String strsize = Long.toString(size) +"\n";
//System.out.println("File size in bytes: " + strsize);
outToClient.writeBytes(strsize);
FileInputStream fis = new FileInputStream(filepath);
byte[] filebuffer = new byte[8192];
while(fis.read(filebuffer) > 0){
dos.write(filebuffer);
dos.flush();
}
File recieving code
public void saveFile() throws Exception{
String size = inFromServer.readLine();
long longsize = Long.parseLong(size);
//System.out.println(longsize);
String tmppath = currentpath + "\\" + tmpdownloadname;
DataInputStream dis = new DataInputStream(clientSocket.getInputStream());
FileOutputStream fos = new FileOutputStream(tmppath);
byte[] filebuffer = new byte[8192];
int read = 0;
int remaining = (int)longsize;
while((read = dis.read(filebuffer, 0, Math.min(filebuffer.length, remaining))) > 0){
//System.out.println(Math.min(filebuffer.length, remaining));
//System.out.println(read);
//System.out.println(remaining);
remaining -= read;
fos.write(filebuffer,0, read);
}
}
I'd like to know how exactly buffers on both sides are handled to avoid writing wrong bytes. (ik how receiving code avoids that but i'd still like to know how byte array is handled)
Does fis/dis always wait for buffers to fill up fully? In receiving code it always writes full array or remaining length if it's less than filebuffer.length but what about fis from sending code.
In fact, your code could have a subtle bug, exactly because of the way you handle buffers.
When you read a buffer from the original file, the read(byte[]) method returns the number of bytes actually read. There is no guarantee that, in fact, all 8192 bytes have been read.
Suppose you have a file with 10000 bytes. Your first read operation reads 8192 bytes. Your second read operation, however, will only read 1808 bytes. The third operation will return -1.
In the first read, you write exactly the bytes that you have read, because you read a full buffer. But in the second read, your buffer actually contains 1808 correct bytes, and the remaining 6384 bytes are wrong - they are still there from the previous read.
In this case you are lucky, because this only happens in the last buffer that you write. Thus, the fact that you stop reading on your client side when you reach the pre-sent length causes you to skip those 6384 wrong bytes which you shouldn't have sent anyway.
But in fact, there is no actual guarantee that reading from the file will return 8192 bytes even if the end was not reached yet. The method's contract does not guarantee that, and it's up to the OS and underlying file system. It could, for example, send you 5000 bytes in your first read, and 5000 in your second read. In this case, you would be sending 3192 wrong bytes in the middle of the file.
Therefore, your code should actually look like:
byte[] filebuffer = new byte[8192];
int read = 0;
while(( read = fis.read(filebuffer)) > 0){
dos.write(filebuffer,0,read);
dos.flush();
}
much like the code you have on the receiving side. This guarantees that only the actual bytes read will be written.
So there is nothing actually magical about the way buffers are handled. You give the stream a buffer, you tell it how much of the buffer it's allowed to fill, but there is no guarantee it will fill all of it. It may fill less and you have to take care and use only the portion it tells you it fills.
Another grave mistake you are making, though, is to just convert the long that you received into an int in this line:
int remaining = (int)longsize;
Files may be longer than an integer contains. Especially things like long videos etc. This is why you get that number as a long in the first place. Don't truncate it like that. Keep the remaining as long and change it to int only after you have taken the minimum (because you know the minimum will always be in the range of an int).
long remaining = longsize;
long fileBufferLen = filebuffer.length;
while((read = dis.read(filebuffer, 0, (int)Math.min(fileBufferLen, remaining))) > 0){
...
}
By the way, there is no real reason to use a DataOutputStream and DataInputStream for this. The read(byte[]), read(byte[],int,int), write(byte[]), and write(byte[],int,int) are inherited from the underlying InputStream and there is no reason not to use the socket's OutputStream/InputStream directly, or use a BufferedOutputStream/BufferedOutputStream to wrap it. There is also no need to use flush until you have finished writing/reading.
Also, do not forget to close at least your file input/output streams when you are done with them. You may want to keep the socket input/output streams open for continued communication, but there is no need to keep the files themselves open, it may cause problems. Use a try-with-resources to guarantee that they are closed.

How can I prevent memory leaks in a Scala code?

I have the code below that first read a file and then put these information in a HashMap(indexCategoryVectors). The HashMap contains a String (key) and a Long (value). The code uses the Long value to access a specific position of another file with RandomAccessFile.
By the information read in this last file and some manipulations the code write new information in another file (filename4). The only variable that accumulates information is the buffer (var buffer = new ArrayBuffer[Map[Int, Double]]()) but after each interaction the buffer is cleaned (buffer.clear).
The foreach command should run more than 4 million times, and what I'm realizing there is an accumulation in memory. I tested the code with a million times interaction and the code used more than 32GB of memory. I don't know the reason for that, maybe it's about Garbage Collection or anything else in JVM. Does anybody knows what can I do to prevent this memory leak?
def main(args: Array[String]): Unit = {
val indexCategoryVectors = getIndexCategoryVectors("filename1")
val uriCategories = getMappingURICategories("filename2")
val raf = new RandomAccessFile("filename3", "r")
var buffer = new ArrayBuffer[Map[Int, Double]]()
// Through each hashmap key.
uriCategories.foreach(uri => {
var emptyInterpretation = true
uri._2.foreach(categoria => {
val position = indexCategoryVectors.get(categoria)
// go to position
raf.seek(position.get)
var vectorSpace = parserVector(raf.readLine)
buffer += vectorSpace
//write the information of buffer in file
writeInformation("filename4")
buffer.clear
}
})
})
println("Success!")
}

Why CodedInputStream.readRawVarint64() is reading all the bytes from underlying stream?

Here is a sample code demonstrating the problem.
ByteArrayOutputStream bos = new ByteArrayOutputStream();
CodedOutputStream cos = CodedOutputStream.newInstance(bos);
cos.writeRawVarint64(25);
cos.flush();
bos.write("something else".getBytes());
System.out.println("size(bos) = " + bos.size()); // This gives 15
ByteArrayInputStream bis = new ByteArrayInputStream(bos.toByteArray());
CodedInputStream cis = CodedInputStream.newInstance(bis);
System.out.println("size(bis) = " + bis.available()); // This gives 15
long l = cis.readRawVarint64();
System.out.println(cis.getTotalBytesRead()); // This gives 1, which is correct
System.out.println("Raw varint64 = " + l); // This gives 25, which is correct
System.out.println("size(bis) = " + bis.available()); // This now gives 0!!
All I am trying to do is to encode a 64 bit integer and add some more data to the payload. I can read the encoded data correctly. But for some reason, it clears the underlying stream after that. Any one know why this is happening? How can I read the varint from stream and read the remaining bytes as indicated by the varint?
Any help would be great
I have no idea what codedinputstream does but it could very well buffer the input meaning it reads e.g. 100 bytes a time.
Either way you should not wrap an inputstream B around an inputstream A and continue reading from A specifically because you don't know what B does.
For instance maybe B must look ahead in the data to form some conclusion or it uses buffering or...
Additional note: available() is usually a bad idea though it should work correctly specifically on a bytearrayinputstream.
EDIT:
In conclusion: just continue reading from the codedinputstream, don't try to read from the underlying one.

Load text file to memory in Java

I have wiki.txt file and its size is 50 MB.
I need to do several things on the file and so I thought that the best way in terms of performance is to load the file to memory, is that correct?
This is the code that I written:
File file = new File("wiki.txt");
FileInputStream fileInputStream = new FileInputStream(file);
FileChannel fileChannel = fileInputStream.getChannel();
MappedByteBuffer mapByteBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, file.length());
System.out.println((char)mapByteBuffer.get());
I get error on this code: mapByteBuffer.get().
I tried the get() function a few options but all of them I get error and didn't even get an error on e.getMessage() I just got null.
Another important thing to note, my text file contains English words and actions I need to do is search, if expressed is exist in this text file.
Thank you.
I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
My answers for point (1):
It depends on what you want to do with the file. If your processing doesn't involve rewind operation (looking what was read behind/before), it's best to just read as a stream and process it in one go (instead of loading all into memory).
Even if you need random access across the file, you may also be interested in doing block file operation, because your solution may not scale well when the file size change to bigger size.
RandomAccessFile if you are on Java 1.4 or above.
For random access, the operating system usually handles the file buffer caching quite well you don't have to handle yourself.
It is important to read the whole error, not just the message. Often the real information is in the exception's name not the text associated with it.
You will get an error if the file is empty as there is no first byte.
Note: the approach you are using assumes ASCII 7-bit characters. If you want to assume ISO-8859-1 characters you can use (char) (byteBuffer.get() & 0xFF)
However, if you have plan text you may find that using strings is simpler to use and not much slower. e.g. you can read a 50 MB file as text in less than a second. I would only use a memory mapped file if this is far too long.
I would suggest to use BufferedReader. It is much faster and requires relatively less resources.
First read number of lines:
InputStream is = new BufferedInputStream(new FileInputStream(filename));
byte[] chars = new byte[1024];
int numberOfChars = 0;
while ((numberOfChars = is.read(chars)) != -1)
{
for (int i = 0; i < numberOfChars; ++i)
{
if (chars[i] == '\n' && numberOfChars - i != 1)
{
++count;
}
}
}
count++
return count; // number of lines
Then read the lines:
BufferedReader in = new BufferedReader(new FileReader(fileName));
for (int i = 0; i < endLine; i++)
{
String oneLine = in.readLine();
}
In this strings you can even do search for what you need.

Reading and writing binary file in Java (seeing half of the file being corrupted)

I have some working code in python that I need to convert to Java.
I have read quite a few threads on this forum but could not find an answer. I am reading in a JPG image and converting it into a byte array. I then write this buffer it to a different file. When I compare the written files from both Java and python code, the bytes at the end do not match. Please let me know if you have a suggestion. I need to use the byte array to pack the image into a message that needs to be sent over to a remote server.
Java code (Running on Android)
Reading the file:
File queryImg = new File(ImagePath);
int imageLen = (int)queryImg.length();
byte [] imgData = new byte[imageLen];
FileInputStream fis = new FileInputStream(queryImg);
fis.read(imgData);
Writing the file:
FileOutputStream f = new FileOutputStream(new File("/sdcard/output.raw"));
f.write(imgData);
f.flush();
f.close();
Thanks!
InputStream.read is not guaranteed to read any particular number of bytes and may read less than you asked it to. It returns the actual number read so you can have a loop that keeps track of progress:
public void pump(InputStream in, OutputStream out, int size) {
byte[] buffer = new byte[4096]; // Or whatever constant you feel like using
int done = 0;
while (done < size) {
int read = in.read(buffer);
if (read == -1) {
throw new IOException("Something went horribly wrong");
}
out.write(buffer, 0, read);
done += read;
}
// Maybe put cleanup code in here if you like, e.g. in.close, out.flush, out.close
}
I believe Apache Commons IO has classes for doing this kind of stuff so you don't need to write it yourself.
Your file length might be more than int can hold and than you end up having wrong array length, hence not reading entire file into the buffer.

Categories