I have the code below that first read a file and then put these information in a HashMap(indexCategoryVectors). The HashMap contains a String (key) and a Long (value). The code uses the Long value to access a specific position of another file with RandomAccessFile.
By the information read in this last file and some manipulations the code write new information in another file (filename4). The only variable that accumulates information is the buffer (var buffer = new ArrayBuffer[Map[Int, Double]]()) but after each interaction the buffer is cleaned (buffer.clear).
The foreach command should run more than 4 million times, and what I'm realizing there is an accumulation in memory. I tested the code with a million times interaction and the code used more than 32GB of memory. I don't know the reason for that, maybe it's about Garbage Collection or anything else in JVM. Does anybody knows what can I do to prevent this memory leak?
def main(args: Array[String]): Unit = {
val indexCategoryVectors = getIndexCategoryVectors("filename1")
val uriCategories = getMappingURICategories("filename2")
val raf = new RandomAccessFile("filename3", "r")
var buffer = new ArrayBuffer[Map[Int, Double]]()
// Through each hashmap key.
uriCategories.foreach(uri => {
var emptyInterpretation = true
uri._2.foreach(categoria => {
val position = indexCategoryVectors.get(categoria)
// go to position
raf.seek(position.get)
var vectorSpace = parserVector(raf.readLine)
buffer += vectorSpace
//write the information of buffer in file
writeInformation("filename4")
buffer.clear
}
})
})
println("Success!")
}
Related
I have an application for users to get data from database and download as csv file.
The general workflow follows:
User click download button at frontend.
Backend (SpringBoot in this case) will start an async thread to get data from database.
Generate csv files with data from step (2) and upload to google cloud storage.
Send user an email with signed url to download the data.
My problem is backend keep throwing "OOM Java heap space" error under some extreme cases. For extreme case, all my memory was filled (4GB). My initial plan was to load data via pagination from database (not all at once to save memory), and generate a csv for each page data. In this case, GC will clear the memory once a csv was generated to keep whole memory usage is not that high. However, the actual case is memory is increasing all the time until all are used up. The GC does not work as expected. I got total 18 pages and around 200000 record (from db) per page at extreme case.
I used JProfiler to monitor heap usage and found that the retained size of those large byte[] objects are not 0 which might represent there exist some references link to them (I guess that's why GC does not clear them from memory as expected).
How should I optimize my code and VM environment to make sure the memory usage can be lower than 1GB for extreme case? What makes those large byte[] objects not cleared by GC as expected?
The code to get data from database and generate csv file
#Override
#Async
#Transactional(timeout = DOWNLOAD_DATA_TRANSACTION_TIME_LIMIT)
public void startDownloadDataInCSVBySearchQuery(SearchQuery query, DownloadRequestRecord downloadRecord) throws IOException {
logger.debug(Thread.currentThread().getName() + ": starts to process download data");
String username = downloadRecord.getUsername();
// get posts from database first
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
int postsSize = postsIDs.size();
// do pagination db search. For each page, there are 1500 posts
int numPages = postsSize / POSTS_COUNT_PER_PAGE + 1;
for (int i = 0; i < numPages; i++) {
logger.debug("Download comments: start at page {}, out of total page {}", i + 1, numPages);
int pageStartPos = i * POSTS_COUNT_PER_PAGE; // this is set to 1500
int pageEndPos = Math.min((i + 1) * POSTS_COUNT_PER_PAGE, postsSize);
// get post ids per page
List<String> postsIDsPerPage = postsIDs.subList(pageStartPos, pageEndPos);
// use posts ids to get corresponding comments from db, via sql "IN"
List<Comment> commentsPerPage = this.commentsService.getCommentsByPostsIDs(postsIDsPerPage);
// generate csv file for page data and upload to google cloud
String commentsFileName = "comments-" + downloadRecord.getDownloadTime() + "-" + (i + 1) + ".csv";
this.csvUtil.generateCommentsCsvFileStream(commentsPerPage, commentsFileName, out);
this.googleCloudStorageInstance.uploadDownloadOutputStreamData(out.toByteArray(), commentsFileName);
}
} catch (Exception ex) {
logger.error("Exception from downloading data: ", ex);
}
Code to generate csv file
// use Apache csv
public void generateCommentsCsvFileStream(List<Comment> comments, String filename, ByteArrayOutputStream out) throws IOException {
CSVPrinter csvPrinter = new CSVPrinter(new OutputStreamWriter(out), CSVFormat.DEFAULT.withHeader(PostHeaders.class).withQuoteMode(QuoteMode.MINIMAL));
for (Comment comment: comments) {
List<Object> record = Arrays.asList(
// write csv content
comment.getPageId(),
...
);
csvPrinter.printRecord(record);
}
// close printer to release memory
csvPrinter.flush();
csvPrinter.close();
}
Code to upload file to goole cloud storage
public Blob uploadDownloadOutputStreamData(byte[] fileStream, String filename) {
logger.debug("Upload file: '{}' to google cloud storage", filename);
BlobId blobId = BlobId.of(this.DownloadDataBucketName, filename);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
return this.cloudStorage.create(blobInfo, fileStream);
}
The heap usage is increasing all the time as page increases.The G1 old gen heap usage is still very high after system crush.
The G1 Eden space is almost empty, big files are saved into Old gen directly.
Old gen GC activity is low, most of GC activities come from Eden space:
Heap walker shows the retained size of those big byte[] is not 0.
You're using a single instance of ByteArrayOutputStream which just writes to a in-memory byte array.
That looks like a mistake because you seem to only want to upload each page at a time, not the accumulated result so far (which includes ALL pages).
By the way, doing this is useless:
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
ByteArrayOutputStream does not need to be closed as it lives in memory. Just remove that. And create a new instance for each page (inside the pages for loop) instead of re-using the same instance for all pages and it might just work fine.
EDIT
Another advice would be to break this code up into more methods... not just because it's more readable with smaller methods, but because you're keeping temporary variables in scope for too long (causing unnecessary memory to stick around longer than needed).
For example:
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
....
From this point on, posts is not used anymore, and I assume that it contains a lot of stuff... so you should "drop" that variable once you got the IDs.
Do something like this instead:
List<String> postsIDs = getAllPostIds(query);
....
List<String> getAllPostIds(SearchQuery query) {
// this variable will be GC'd after this method returns as it's no longer referenced (assuming getPostIDsFromPosts() doesn't store it in a field)
List<? extends SocialPost> posts = this.postsService.getPosts(query);
return this.getPostsIDsFromPosts(posts);
}
So, imagine that I have a Scala Vert.x Web REST API that receives file uploads via HTTP multipart requests. However, it doesn't receive the incoming file data as a single InputStream. Instead, each file is received as a series of byte buffers handed over via a few callback functions.
The callbacks basically look like this:
// the callback that receives byte buffers (chunks) of the file being uploaded
// it is called multiple times until the full file has been received
upload.handler { buffer =>
// send chunk to backend
}
// the callback that gets called after the full file has been uploaded
// (i.e. after all chunks have been received)
upload.endHandler { _ =>
// do something after the file has been uploaded
}
// callback called if an exception is raised while receiving the file
upload.exceptionHandler { e =>
// do something to handle the exception
}
Now, I'd like to use these callbacks to save the file into a MinIO Bucket (MinIO, if you're unfamiliar, is basically self-hosted S3 and it's API is pretty much the same as the S3 Java API).
Since I don't have a file handle, I need to use putObject() to put an InputStream into MinIO.
The inefficient work-around I'm currently using with the MinIO Java API looks like this:
// this is all inside the context of handling a HTTP request
val out = new PipedOutputStream()
val in = new PipedInputStream()
var size = 0
in.connect(out)
upload.handler { buffer =>
s.write(buffer.getBytes)
size += buffer.length()
}
upload.endHandler { _ =>
minioClient.putObject(
PutObjectArgs.builder()
.bucket("my-bucket")
.object("my-filename")
.stream(in, size, 50000000)
.build())
}
Obviously, this isn't optimal. Since I'm using a simple java.io stream here, the entire file ends up getting loaded into memory.
I don't want to save the File to disk on the server before putting it into object storage. I'd like to put it straight into my object storage.
How could I accomplish this using the S3 API and a series of byte buffers given to me via the upload.handler callback?
EDIT
I should add that I am using MinIO because I cannot use a commercially-hosted cloud solution, like S3. However, as mentioned on MinIO's website, I can use Amazon's S3 Java SDK while using MinIO as my storage solution.
I attempted to follow this guide on Amazon's website for uploading objects to S3 in chunks.
That solution I attempted looks like this:
context.request.uploadHandler { upload =>
println(s"Filename: ${upload.filename()}")
val partETags = new util.ArrayList[PartETag]
val initRequest = new InitiateMultipartUploadRequest("docs", "my-filekey")
val initResponse = s3Client.initiateMultipartUpload(initRequest)
upload.handler { buffer =>
println("uploading part", buffer.length())
try {
val request = new UploadPartRequest()
.withBucketName("docs")
.withKey("my-filekey")
.withPartSize(buffer.length())
.withUploadId(initResponse.getUploadId)
.withInputStream(new ByteArrayInputStream(buffer.getBytes()))
val uploadResult = s3Client.uploadPart(request)
partETags.add(uploadResult.getPartETag)
} catch {
case e: Exception => println("Exception raised: ", e)
}
}
// this gets called for EACH uploaded file sequentially
upload.endHandler { _ =>
// upload successful
println("done uploading")
try {
val compRequest = new CompleteMultipartUploadRequest("docs", "my-filekey", initResponse.getUploadId, partETags)
s3Client.completeMultipartUpload(compRequest)
} catch {
case e: Exception => println("Exception raised: ", e)
}
context.response.setStatusCode(200).end("Uploaded")
}
upload.exceptionHandler { e =>
// handle the exception
println("exception thrown", e)
}
}
}
This works for files that are small (my test small file was 11 bytes), but not for large files.
In the case of large files, the processes inside the upload.handler get progressively slower as the file continues to upload. Also, upload.endHandler is never called, and the file somehow continues uploading after 100% of the file has been uploaded.
However, as soon as I comment out the s3Client.uploadPart(request) portion inside upload.handler and the s3Client.completeMultipartUpload parts inside upload.endHandler (basically throwing away the file instead of saving it to object storage), the file upload progresses as normal and terminates correctly.
I figured out what I was doing wrong (when using the S3 client). I was not accumulating bytes inside my upload.handler. I need to accumulate bytes until the buffer size is big enough to upload a part, rather than upload each time I receive a few bytes.
Since neither Amazon's S3 client nor the MinIO client did what I want, I decided to dig into how putObject() was actually implemented and make my own. This is what I came up with.
This implementation is specific to Vert.X, however it can easily be generalized to work with built-in java.io InputStreams via a while loop and using a pair of Piped- streams.
This implementation is also specific to MinIO, but it can easily be adapted to use the S3 client since, for the most part, the two APIs are the same.
In this example, Buffer is basically a container around a ByteArray and I'm not really doing anything special here. I replaced it with a byte array to ensure that it would still work, and it did.
package server
import com.google.common.collect.HashMultimap
import io.minio.MinioClient
import io.minio.messages.Part
import io.vertx.core.buffer.Buffer
import io.vertx.core.streams.ReadStream
import scala.collection.mutable.ListBuffer
class CustomMinioClient(client: MinioClient) extends MinioClient(client) {
def putReadStream(bucket: String = "my-bucket",
objectName: String,
region: String = "us-east-1",
data: ReadStream[Buffer],
objectSize: Long,
contentType: String = "application/octet-stream"
) = {
val headers: HashMultimap[String, String] = HashMultimap.create()
headers.put("Content-Type", contentType)
var uploadId: String = null
try {
val parts = new ListBuffer[Part]()
val createResponse = createMultipartUpload(bucket, region, objectName, headers, null)
uploadId = createResponse.result.uploadId()
var partNumber = 1
var uploadedSize = 0
// an array to use to accumulate bytes from the incoming stream until we have enough to make a `uploadPart` request
var partBuffer = Buffer.buffer()
// S3's minimum part size is 5mb, excepting the last part
// you should probably implement your own logic for determining how big
// to make each part based off the total object size to avoid unnecessary calls to S3 to upload small parts.
val minPartSize = 5 * 1024 * 1024
data.handler { buffer =>
partBuffer.appendBuffer(buffer)
val availableSize = objectSize - uploadedSize - partBuffer.length
val isMinPartSize = partBuffer.length >= minPartSize
val isLastPart = uploadedSize + partBuffer.length == objectSize
if (isMinPartSize || isLastPart) {
val partResponse = uploadPart(
bucket,
region,
objectName,
partBuffer.getBytes,
partBuffer.length,
uploadId,
partNumber,
null,
null
)
parts.addOne(new Part(partNumber, partResponse.etag))
uploadedSize += partBuffer.length
partNumber += 1
// empty the part buffer since we have already uploaded it
partBuffer = Buffer.buffer()
}
}
data.endHandler { _ =>
completeMultipartUpload(bucket, region, objectName, uploadId, parts.toArray, null, null)
}
data.exceptionHandler { exception =>
// should also probably abort the upload here
println("Handler caught exception in custom putObject: " + exception)
}
} catch {
// and abort it here as well...
case e: Exception =>
println("Exception thrown in custom `putObject`: " + e)
abortMultipartUpload(
bucket,
region,
objectName,
uploadId,
null,
null
)
}
}
}
This can all be used pretty easily.
First, set up the client:
private val _minioClient = MinioClient.builder()
.endpoint("http://localhost:9000")
.credentials("my-username", "my-password")
.build()
private val myClient = new CustomMinioClient(_minioClient)
Then, where you receive the upload request:
context.request.uploadHandler { upload =>
myClient.putReadStream(objectName = upload.filename(), data = upload, objectSize = myFileSize)
context.response().setStatusCode(200).end("done")
}
The only catch with this implementation is that you need to know the file sizes in advance for the request.
However, this can easily be solved the way I did it, especially if you're using a web UI.
Before attempting to upload the files, send a request to the server containing a map of file name to file size.
That pre-request should generate a unique ID for the upload.
The server can save group of filename->filesize using the upload ID as an index. - Server sends the upload ID back to the client.
Client sends the multipart upload request using the upload ID
Server pulls the list of files and their sizes and uses it to call .putReadStream()
I am trying to read from S3 and writing into InMemory buffer like:
def inMemoryDownload(bucketName: String, key: String): String = {
val s3Object = s3client.getObject(new GetObjectRequest(bucketName, key))
val s3Stream = s3Object.getObjectContent()
val outputStream = new ByteArrayOutputStream()
val buffer = new Array[Byte](10* 1024)
var bytesRead:Int =s3Stream.read(buffer)
while (bytesRead > -1) {
info("writing.......")
outputStream.write(buffer)
info("reading.......")
bytesRead = ss3Stream.read(buffer)
}
val data = new String(outputStream.toByteArray)
outputStream.close()
s3Object.getObjectContent.close()
data
}
But It is giving me heap space error(Size of file on S3 is 4MB)
You should be using thbytes you just read, when writing into the stream. The way you have it written, writes the entire buffer every time. I doubt that is the cause of your memory problem, but it could be. Imagine that read returns a single byte to you every time, and you write 10K into the stream. That's 40G, right there.
Another problem is, that, I am not 100% sure, but I suspect, that getObjectObject creates a new input stream every time. Basically, you just keep reading the same bytes over and over again in the loop. You should put it into a variable instead.
Also, if I may make a suggestion, try rewriting your code in actual scala, not just syntactically, but idiomatically. Avoid mutable state, and use functional transformations. If you are going to write scala code might as well take some time to get into the right mind set. You'll grow to appreciate it eventually, I promise :)
Something like this, perhaps?
val input = s3Object.getObjectContent
Stream
.continually(input.read(buffer))
.takeWhile(_ > 0)
.foreach { output.write(buffer, 0, _) }
I have the same problem as asked here JVM Monitor char array memory usage. But I did not get the clear answer from this question and I couldn't add a comment because of low reputation. So, I am asking from here.
I wrote a multithreaded program that calculates word co-ocurrence frequencies. I am reading words lazily from file and making calculations. In the program, I have a map which holds word pairs and their co-ocurrence counts. After finishing the counting operation, I am writing this map to a file.
Here is my problem:
After writing frequency map to a file. The size of a file is for example 3gb. But when program runs, used memory is 35gb ram + 5gb swap area. Then I monitor the jvm and the memory picture is like this: and garbage collector picture is like this: and the parameters overwiew:
How char[] array can occupy this much memory when the output file size is 3gb? Thanks.
Okey, here is the code that causes this problem:
This code is not multithreaded and used for merging two files which contains co-occurred words and their counts. And this code also causes the same memory usage problem and furthermore this code causes lots of gc calls because of high heap space usage,so normal program cannot run because of stop the world garbage collector:
import java.io.{BufferedWriter, File, FileWriter, FilenameFilter}
import java.util.regex.Pattern
import core.WordTuple
import scala.collection.mutable.{Map => mMap}
import scala.io.{BufferedSource, Source}
class PairWordsMerger(path: String, regex: String) {
private val wordsAndCounts: mMap[WordTuple, Int] = mMap[WordTuple, Int]()
private val pattern: Pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE)
private val dir: File = new File(path)
private var sWordAndCount: Array[String] = Array.fill(3)("")
private var tempTuple: WordTuple = WordTuple("","")
private val matchedFiles: Array[File] = dir.listFiles(new FilenameFilter {
override def accept(dir: File, name: String): Boolean = pattern.matcher(name).matches()
})
def merge(): Unit = {
for(fileName <- matchedFiles) {
val file: BufferedSource = Source.fromFile(fileName)
val iter: Iterator[String] = file.getLines()
while(iter.hasNext) {
//here I used split like this because entries in the file
//are hold in this format: word1,word2,frequency
sWordAndCount = iter.next().split(",")
tempTuple = WordTuple(sWordAndCount(0), sWordAndCount(1))
try {
wordsAndCounts += (tempTuple -> (wordsAndCounts.getOrElse(tempTuple, 0) + sWordAndCount(2).toInt))
} catch {
case e: NumberFormatException => println("Cannot parse to int...")
}
}
file.close()
println("One pair words map update done")
}
writeToFile()
}
private def writeToFile(): Unit = {
val f: File = new File("allPairWords.txt")
val out = new BufferedWriter(new FileWriter(f))
for(elem <- wordsAndCounts) {
out.write(elem._1 + "," + elem._2 + "\n")
}
out.close()
}
}
object PairWordsMerger {
def apply(path: String, regex: String): PairWordsMerger = new PairWordsMerger(path, regex)
}
I have for example 1000 images and their names are all very similar, they just differ in the number. "ImageNmbr0001", "ImageNmbr0002", ....., ImageNmbr1000 etc.;
I would like to get every image and store them into an ImageProcessor Array.
So, for example, if I use a method on element of this array, then this method is applied on the picture, for example count the black pixel in it.
I can use a for loop the get numbers from 1 to 1000, turn them into a string and create substrings of the filenames to load and then attach the string numbers again to the file name and let it load that image.
However I would still have to turn it somehow into an element I can store in an array and I don't a method yet, that receives a string, in fact the file path and returns the respective ImageProcessor that is stored at it's end.
Also my approach at the moment seems rather clumsy and not too elegant. So I would be very happy, if someone could show me a better to do that using methods from those packages:
import ij.ImagePlus;
import ij.plugin.filter.PlugInFilter;
import ij.process.ImageProcessor;
I think I found a solution:
Opener opener = new Opener();
String imageFilePath = "somePath";
ImagePlus imp = opener.openImage(imageFilePath);
ImageProcesser ip = imp.getProcessor();
That do the job, but thank you for your time/effort.
I'm not sure if I undestand what you want exacly... But I definitly would not save each information of each image in separate files for 2 reasons:
- It's slower to save and read the content of multiple files compare with 1 medium size file
- Each file adds overhead (files need Path, minimum size in disk, etc)
If you want performance, group multiple image descriptions in single description files.
If you dont want to make a binary description file, you can always use a Database, which is build for it, performance in read and normally on save.
I dont know exacly what your needs, but I guess you can try make a binary file with fixed size data and read it later
Example:
public static void main(String[] args) throws IOException {
FileOutputStream fout = null;
FileInputStream fin = null;
try {
fout = new FileOutputStream("description.bin");
DataOutputStream dout = new DataOutputStream(fout);
for (int x = 0; x < 1000; x++) {
dout.writeInt(10); // Write Int data
}
fin = new FileInputStream("description.bin");
DataInputStream din = new DataInputStream(fin);
for (int x = 0; x < 1000; x++) {
System.out.println(din.readInt()); // Read Int data
}
} catch (Exception e) {
} finally {
if (fout != null) {
fout.close();
}
if (fin != null) {
fin.close();
}
}
}
In this example, the code writes integers in "description.bin" file and then read them.
This is pretty fast in Java, since Java uses "channels" for files by default