I have the same problem as asked here JVM Monitor char array memory usage. But I did not get the clear answer from this question and I couldn't add a comment because of low reputation. So, I am asking from here.
I wrote a multithreaded program that calculates word co-ocurrence frequencies. I am reading words lazily from file and making calculations. In the program, I have a map which holds word pairs and their co-ocurrence counts. After finishing the counting operation, I am writing this map to a file.
Here is my problem:
After writing frequency map to a file. The size of a file is for example 3gb. But when program runs, used memory is 35gb ram + 5gb swap area. Then I monitor the jvm and the memory picture is like this: and garbage collector picture is like this: and the parameters overwiew:
How char[] array can occupy this much memory when the output file size is 3gb? Thanks.
Okey, here is the code that causes this problem:
This code is not multithreaded and used for merging two files which contains co-occurred words and their counts. And this code also causes the same memory usage problem and furthermore this code causes lots of gc calls because of high heap space usage,so normal program cannot run because of stop the world garbage collector:
import java.io.{BufferedWriter, File, FileWriter, FilenameFilter}
import java.util.regex.Pattern
import core.WordTuple
import scala.collection.mutable.{Map => mMap}
import scala.io.{BufferedSource, Source}
class PairWordsMerger(path: String, regex: String) {
private val wordsAndCounts: mMap[WordTuple, Int] = mMap[WordTuple, Int]()
private val pattern: Pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE)
private val dir: File = new File(path)
private var sWordAndCount: Array[String] = Array.fill(3)("")
private var tempTuple: WordTuple = WordTuple("","")
private val matchedFiles: Array[File] = dir.listFiles(new FilenameFilter {
override def accept(dir: File, name: String): Boolean = pattern.matcher(name).matches()
})
def merge(): Unit = {
for(fileName <- matchedFiles) {
val file: BufferedSource = Source.fromFile(fileName)
val iter: Iterator[String] = file.getLines()
while(iter.hasNext) {
//here I used split like this because entries in the file
//are hold in this format: word1,word2,frequency
sWordAndCount = iter.next().split(",")
tempTuple = WordTuple(sWordAndCount(0), sWordAndCount(1))
try {
wordsAndCounts += (tempTuple -> (wordsAndCounts.getOrElse(tempTuple, 0) + sWordAndCount(2).toInt))
} catch {
case e: NumberFormatException => println("Cannot parse to int...")
}
}
file.close()
println("One pair words map update done")
}
writeToFile()
}
private def writeToFile(): Unit = {
val f: File = new File("allPairWords.txt")
val out = new BufferedWriter(new FileWriter(f))
for(elem <- wordsAndCounts) {
out.write(elem._1 + "," + elem._2 + "\n")
}
out.close()
}
}
object PairWordsMerger {
def apply(path: String, regex: String): PairWordsMerger = new PairWordsMerger(path, regex)
}
Related
Byte order mark is making my regex fail when using scala.io.Source to read from a file. This answer is a lightweight solution using java.io. Is there anything similar for scala.io.Source, or will I have to revert back to Java because of a single byte?
Based on Joe K's idea in his comment, and using Andrei Punko's answer for the problem in Java and Alvin Alexander's Scala code, the simplest solution to read a file possibly containing byte order mark into an array of string is:
#throws[IOException]
def skip(reader: Reader): Unit = {
reader.mark(1)
val possibleBOM = new Array[Char](1)
reader.read(possibleBOM)
if (possibleBOM(0) != '\ufeff') reader.reset
}
val br = new BufferedReader(new InputStreamReader(new FileInputStream(file)))
skip(br)
val lines = {
val ls = new ArrayBuffer[String]()
var l: String = null
while ({l= br.readLine; l != null}) {
ls.append(l)
}
br.close
ls.toArray
}
I'm having a hard time finding the correct way to do any advanced file-system operations on Linux using Scala.
The one which I really can't figure out if best described by the following pseudo-code:
with fd = open(path, append | create):
with flock(fd, exclusive_lock):
fd.write(string)
Basically open a file in append mode (create it if it's non existent), get an exclusive lock to it and write to it (with the implicit unlock and close afterwards).
Is there an easy, clean and efficient way of doing this if I know my program will be ran on linux only ? (preferably without glancing offer the exceptions that should be handled).
Edit:
The answer I got is, as far as I've seen and tested is correct. However it's quite verbose, so I'm marking it as ok but I'm leaving this snippet of code here, which is the one I ended up using (Not sure if it's correct, but as far as I see it does everything that I need):
val fc = FileChannel.open(Paths.get(file_path), StandardOpenOption.CREATE, StandardOpenOption.APPEND)
try {
fc.lock()
fc.write(ByteBuffer.wrap(message.getBytes(StandardCharsets.UTF_8)))
} finally { fc.close() }
You can use FileChannel.lock and FileLock to get what you wanted:
import java.nio.ByteBuffer
import java.nio.channels.FileChannel
import java.nio.charset.StandardCharsets
import java.nio.file.{Path, Paths, StandardOpenOption}
import scala.util.{Failure, Success, Try}
object ExclusiveFsWrite {
def main(args: Array[String]): Unit = {
val path = Paths.get("/tmp/file")
val buffer = ByteBuffer.wrap("Some text data here".getBytes(StandardCharsets.UTF_8))
val fc = getExclusiveFileChannel(path)
try {
fc.write(buffer)
}
finally {
// channel close will also release a lock
fc.close()
}
()
}
private def getExclusiveFileChannel(path: Path): FileChannel = {
// Append if exist or create new file (if does not exist)
val fc = FileChannel.open(path, StandardOpenOption.WRITE, StandardOpenOption.APPEND,
StandardOpenOption.CREATE)
if (fc.size > 0) {
// set position to the end
fc.position(fc.size - 1)
}
// get an exclusive lock
Try(fc.lock()) match {
case Success(lock) =>
println("Is shared lock: " + lock.isShared)
fc
case Failure(ex) =>
Try(fc.close())
throw ex
}
}
}
I have the code below that first read a file and then put these information in a HashMap(indexCategoryVectors). The HashMap contains a String (key) and a Long (value). The code uses the Long value to access a specific position of another file with RandomAccessFile.
By the information read in this last file and some manipulations the code write new information in another file (filename4). The only variable that accumulates information is the buffer (var buffer = new ArrayBuffer[Map[Int, Double]]()) but after each interaction the buffer is cleaned (buffer.clear).
The foreach command should run more than 4 million times, and what I'm realizing there is an accumulation in memory. I tested the code with a million times interaction and the code used more than 32GB of memory. I don't know the reason for that, maybe it's about Garbage Collection or anything else in JVM. Does anybody knows what can I do to prevent this memory leak?
def main(args: Array[String]): Unit = {
val indexCategoryVectors = getIndexCategoryVectors("filename1")
val uriCategories = getMappingURICategories("filename2")
val raf = new RandomAccessFile("filename3", "r")
var buffer = new ArrayBuffer[Map[Int, Double]]()
// Through each hashmap key.
uriCategories.foreach(uri => {
var emptyInterpretation = true
uri._2.foreach(categoria => {
val position = indexCategoryVectors.get(categoria)
// go to position
raf.seek(position.get)
var vectorSpace = parserVector(raf.readLine)
buffer += vectorSpace
//write the information of buffer in file
writeInformation("filename4")
buffer.clear
}
})
})
println("Success!")
}
I am successfully serving videos using the Play framework, but I'm experiencing an issue: each time a file is served, the Play framework creates a copy in C:\Users\user\AppData\Temp. I'm serving large files so this quickly creates a problem with disk space.
Is there any way to serve a file in Play without creating a copy? Or have Play automatically delete the temp file?
Code I'm using to serve is essentially:
public Result video() {
return ok(new File("whatever"));
}
Use Streaming
I use following method for video streaming. This code does not create temp copies of the media file.
Basically this code responds to the RANGE queries sent by the browser. If browser does not support RANGE queries I fallback to the method where I try to send the whole file using Ok.sendFile (internally play also tries to stream the file) (this might create temp files). but this happens very rarely when range queries is not supported by the browser.
GET /media controllers.MediaController.media
Put this code inside a Controller called MediaController
def media = Action { req =>
val file = new File("/Users/something/Downloads/somefile.mp4")
val rangeHeaderOpt = req.headers.get(RANGE)
rangeHeaderOpt.map { range =>
val strs = range.substring("bytes=".length).split("-")
if (strs.length == 1) {
val start = strs.head.toLong
val length = file.length() - 1L
partialContentHelper(file, start, length)
} else {
val start = strs.head.toLong
val length = strs.tail.head.toLong
partialContentHelper(file, start, length)
}
}.getOrElse {
Ok.sendFile(file)
}
}
def partialContentHelper(file: File, start: Long, length: Long) = {
val fis = new FileInputStream(file)
fis.skip(start)
val byteStringEnumerator = Enumerator.fromStream(fis).&>(Enumeratee.map(ByteString.fromArray(_)))
val mediaSource = Source.fromPublisher(Streams.enumeratorToPublisher(byteStringEnumerator))
PartialContent.sendEntity(HttpEntity.Streamed(mediaSource, None, None)).withHeaders(
CONTENT_TYPE -> MimeTypes.forExtension("mp4").get,
CONTENT_LENGTH -> ((length - start) + 1).toString,
CONTENT_RANGE -> s"bytes $start-$length/${file.length()}",
ACCEPT_RANGES -> "bytes",
CONNECTION -> "keep-alive"
)
}
I have for example 1000 images and their names are all very similar, they just differ in the number. "ImageNmbr0001", "ImageNmbr0002", ....., ImageNmbr1000 etc.;
I would like to get every image and store them into an ImageProcessor Array.
So, for example, if I use a method on element of this array, then this method is applied on the picture, for example count the black pixel in it.
I can use a for loop the get numbers from 1 to 1000, turn them into a string and create substrings of the filenames to load and then attach the string numbers again to the file name and let it load that image.
However I would still have to turn it somehow into an element I can store in an array and I don't a method yet, that receives a string, in fact the file path and returns the respective ImageProcessor that is stored at it's end.
Also my approach at the moment seems rather clumsy and not too elegant. So I would be very happy, if someone could show me a better to do that using methods from those packages:
import ij.ImagePlus;
import ij.plugin.filter.PlugInFilter;
import ij.process.ImageProcessor;
I think I found a solution:
Opener opener = new Opener();
String imageFilePath = "somePath";
ImagePlus imp = opener.openImage(imageFilePath);
ImageProcesser ip = imp.getProcessor();
That do the job, but thank you for your time/effort.
I'm not sure if I undestand what you want exacly... But I definitly would not save each information of each image in separate files for 2 reasons:
- It's slower to save and read the content of multiple files compare with 1 medium size file
- Each file adds overhead (files need Path, minimum size in disk, etc)
If you want performance, group multiple image descriptions in single description files.
If you dont want to make a binary description file, you can always use a Database, which is build for it, performance in read and normally on save.
I dont know exacly what your needs, but I guess you can try make a binary file with fixed size data and read it later
Example:
public static void main(String[] args) throws IOException {
FileOutputStream fout = null;
FileInputStream fin = null;
try {
fout = new FileOutputStream("description.bin");
DataOutputStream dout = new DataOutputStream(fout);
for (int x = 0; x < 1000; x++) {
dout.writeInt(10); // Write Int data
}
fin = new FileInputStream("description.bin");
DataInputStream din = new DataInputStream(fin);
for (int x = 0; x < 1000; x++) {
System.out.println(din.readInt()); // Read Int data
}
} catch (Exception e) {
} finally {
if (fout != null) {
fout.close();
}
if (fin != null) {
fin.close();
}
}
}
In this example, the code writes integers in "description.bin" file and then read them.
This is pretty fast in Java, since Java uses "channels" for files by default