Scanner.findInLine() leaks memory massively

Scanner.findInLine() leaks memory massively - java

I'm running a simple scanner to parse a string, however I've discovered that if called often enough I get OutOfMemory errors. This code is called as part of the constructor of an object that is built repeatedly for an array of strings :
Edit: Here's the constructor for more infos; not much more happening outside of the try-catch regarding the Scanner
public Header(String headerText) {
char[] charArr;
charArr = headerText.toCharArray();
// Check that all characters are printable characters
if (charArr.length > 0 && !commonMethods.isPrint(charArr)) {
throw new IllegalArgumentException(headerText);
}
// Check for header suffix
Scanner sc = new Scanner(headerText);
MatchResult res;
try {
sc.findInLine("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
res = sc.match();
} finally {
sc.close();
}
if (res.group(1) == null || res.group(1).isEmpty()) {
throw new IllegalArgumentException("Missing header keyword found"); // Empty header to store
} else {
mnemonic = res.group(1).toLowerCase(); // Store header
}
if (res.group(2) == null || res.group(2).isEmpty()) {
suffix = -1;
} else {
try {
suffix = Integer.parseInt(res.group(2)); // Store suffix if it exists
} catch (NumberFormatException e) {
throw new NumberFormatException(headerText);
}
}
if (res.group(3) == null || res.group(3).isEmpty()) {
isQuery= false;
} else {
if (res.group(3).equals("?")) {
isQuery = true;
} else {
throw new IllegalArgumentException(headerText);
}
}
// If command was of the form *ABC, reject suffixes and prefixes
if (mnemonic.contains("*")
&& suffix != -1) {
throw new IllegalArgumentException(headerText);
}
}
A profiler memory snapshot shows the read(Char) method of Scanner.findInLine() to be allocated massive amounts of memory during operation as a I scan through a few hundred thousands strings; after a few seconds it already is allocated over 38MB.
I would think that calling close() on the scanner after using it in the constructor would flag the old object to be cleared by the GC, but somehow it remains and the read method accumulates gigabytes of data before filling the heap.
Can anybody point me in the right direction?

You haven't posted all your code, but given that you are scanning for the same regex repeatedly, it would be much more efficient to compile a static Pattern beforehand and use this for the scanner's find:
static Pattern p = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
and in the constructor:
sc.findInLine(p);
This may or may not be the source of the OOM issue, but it will definitely make your parsing a bit faster.
Related: java.util.regex - importance of Pattern.compile()?
Update: after you posted more of your code, I see some other issues. If you're calling this constructor repeatedly, it means you are probably tokenizing or breaking up the input beforehand. Why create a new Scanner to parse each line? They are expensive; you should be using the same Scanner to parse the entire file, if possible. Using one Scanner with a precompiled Pattern will be much faster than what you are doing now, which is creating a new Scanner and a new Pattern for each line you are parsing.

The strings that are filling up your memory were created in findInLine(). Therefore, the repeated Pattern creation is not the problem.
Without knowing what the rest of the code does, my guess would be that one of the groups you get out of the matcher is being kept in a field of your object. Then that string would have been allocated in findInLine(), as you see here, but the fact that it is being retained would be due to your code.
Edit:
Here's your problem:
mnemonic = res.group(1).toLowerCase();
What you might not realize is that toLowerCase() returns this if there are no uppercase letters in the string. Also, group(int) returns a substring(), which creates a new string backed by the same char[] as the full string. So, mnemonic actually contains the char[] for the entire line.
The fix would just be:
mnemonic = new String(res.group(1).toLowerCase());

I think that your code snippet is not full. I believe you are calling scanner.findInLine() in loop. Anyway, try to call scanner.reset(). I hope this will solve your problem.

The JVM apparently does not have time to Garbage collect. Possibly because it's using the same code (the constructor) repeatedly to create multiple instances of the same class. The JVM may not do anything about GC until something changes on the run time stack -- and in this case that's not happening. I've been warned in the past about doing "too much" in a constructor as some of the memory management behaviors are not quite the same when other methods are being called.

Your problem is that you are scanning through a couple hundred thousand strings and you are passing the pattern in as a string, so you have a new pattern object for every single iteration of the loop. You can pull the pattern out of the loop, like so:
Pattern toMatch = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)")
Scanner sc = new Scanner(headerText);
MatchResult res;
try {
sc.findInLine(toMatch);
res = sc.match();
} finally {
sc.close();
}
Then you will only be passing the object reference to toMatch instead of having the overhead of creating a new pattern object for every attempt at a match. This will fix your leak.

Well I've found the source of the problem, it wasn't Scanner exactly but the list holding the objects doing the scanning in the constructor.
The problem had to do with the overrun of a list that was holding references to the object containing the parsing, essentially more strings were received per unit of time than could be processed and the list grew and grew until there were no more RAM. Bounding this list to a maximum size now prevents the parser from overloading the memory; I'll be adding some synchronization between the parser and the data source to avoid this overrun in the future.
Thank you all for your suggestions, I've already made some changes performance wise regarding the scanner and thank you to #RobI for pointing me to jvisualvm which allowed me to trace back the exact culprits holding the references. The memory dump wasn't showing the reference linking.

Related

Trying to add substrings from newLines in a large file to a list

I downloaded my extended listening history from Spotify and I am trying to make a program to turn the data into a list of artists without doubles I can easily make sense of. The file is rather huge because it has data on every stream I have done since 2016 (307790 lines of text in total). This is what 2 lines of the file looks like:
{"ts":"2016-10-30T18:12:51Z","username":"edgymemes69endmylifepls","platform":"Android OS 6.0.1 API 23 (HTC, 2PQ93)","ms_played":0,"conn_country":"US","ip_addr_decrypted":"68.199.250.233","user_agent_decrypted":"unknown","master_metadata_track_name":"Devil's Daughter (Holy War)","master_metadata_album_artist_name":"Ozzy Osbourne","master_metadata_album_album_name":"No Rest for the Wicked (Expanded Edition)","spotify_track_uri":"spotify:track:0pieqCWDpThDCd7gSkzx9w","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"fwdbtn","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":0,"incognito_mode":false},
{"ts":"2021-03-26T18:15:15Z","username":"edgymemes69endmylifepls","platform":"Android OS 11 API 30 (samsung, SM-F700U1)","ms_played":254120,"conn_country":"US","ip_addr_decrypted":"67.82.66.3","user_agent_decrypted":"unknown","master_metadata_track_name":"Opportunist","master_metadata_album_artist_name":"Sworn In","master_metadata_album_album_name":"Start/End","spotify_track_uri":"spotify:track:3tA4jL0JFwFZRK9Q1WcfSZ","episode_name":null,"episode_show_name":null,"spotify_episode_uri":null,"reason_start":"fwdbtn","reason_end":"trackdone","shuffle":true,"skipped":null,"offline":false,"offline_timestamp":1616782259928,"incognito_mode":false},
It is formatted in the actual text file so that each stream is on its own line. NetBeans is telling me the exception is happening at line 19 and it only fails when I am looking for a substring bounded by the indexOf function. My code is below. I have no idea why this isn't working, any ideas?
import java.util.*;
public class MainClass {
public static void main(String args[]){
File dat = new File("SpotifyListeningData.txt");
List<String> list = new ArrayList<String>();
Scanner swag = null;
try {
swag = new Scanner(dat);
}
catch(Exception e) {
System.out.println("pranked");
}
while (swag.hasNextLine())
if (swag.nextLine().length() > 1)
if (list.contains(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album"))))
System.out.print("");
else
try {list.add(swag.nextLine().substring(swag.nextLine().indexOf("artist_name"), swag.nextLine().indexOf("master_metadata_album_album")));}
catch(Exception e) {}
System.out.println(list);
}
}

Find a JSON parser you like.
Create a class that with the fields you care about marked up to the parsers specs.
Read the file into a collection of objects. Most parsers will stream the contents so you're not string a massive string.
You can then load the data into objects and store that as you see fit. For your purposes, a TreeSet is probably what you want.

Your code will throw a lot of exceptions only because you don't use braces. Please do use braces in each blocks, whether it is if, else, loops, whatever. It's a good practice and prevent unnecessary bugs.
However, everytime scanner.nextLine() is called, it reads the next line from the file, so you need to avoid using that in this way.
The best way to deal with this is to write a class containing the fields same as the json in each line of the file. And map the json to the class and get desired field value from that.
Your way is too much risky and dependent on structure of the data, even on whitespaces. However, I fixed some lines in your code and this will work for your purpose, although I actually don't prefer operating string in this way.
while (swag.hasNextLine()) {
String swagNextLine = swag.nextLine();
if (swagNextLine.length() > 1) {
String toBeAdded = swagNextLine.substring(swagNextLine.indexOf("artist_name") + "artist_name".length() + 2
, swagNextLine.indexOf("master_metadata_album_album") - 2);
if (list.contains(toBeAdded)) {
System.out.print("Match");
} else {
try {
list.add(toBeAdded);
} catch (Exception e) {
System.out.println("Add to list failed");
}
}
System.out.println(list);
}
}

Does Java create object even if it's not initialized directly?

If I initialize String array directly like this String[] Distro = Distros.split(","); then it'll create an object because variable Distro is holding the array.
But If I do it this way then it'll also create an object?
String Distros = "CentOS,RHEL,Debian,Ubuntu";
for (String s : Distros.split(",")) {
System.out.println(s);
}
My goal is to reduce object creation to minimize garbage.

Your reasoning “then it'll create an object because variable Distro is holding the array” indicates that you are confusing object creation with variable assignment.
The object is created by the expression Distros.split(","), not the subsequent assignment. It should become obvious when you consider that the split method is an ordinary Java method creating and returning the array without any knowledge about what the caller will do with the result.
When the operation happens in a performance critical code, you might use
int p = 0;
for(int e; (e = Distros.indexOf(',', p)) >= 0; p = e+1)
System.out.println(Distros.substring(p, e));
System.out.println(Distros.substring(p));
instead. It’s worth pointing out that this saves the array creation but still performs the creation of the substrings, which is the more expensive aspect of it. Without knowing what you are actually going to do with the substrings, it’s impossible to say whether there are alternatives which can save the substring creation¹.
But this loop still has an advantage over the split method. The split method creates all substrings and returns an array holding references to them, forcing them to exist at the same time, during the entire loop. The loop above calls substring when needed and doesn’t keep a reference when going to the next. Hence, the strings are not forced to exist all the time and the garbage collector is free to decide when to collect them, depending on the current memory utilization.
¹ I assume that printing is just an example. But to stay at the example, you could replace
System.out.println(Distros.substring(p, e));
with
System.out.append(Distros, p, e).println();
The problem is, this only hides the substring creation, at least in the reference implementation which will eventually perform the substring creation behind the scenes.
An alternative is
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(FileDescriptor.out)));
try {
int p = 0; for(int e; (e = Distros.indexOf(',', p)) >= 0; p = e+1) {
bw.write(Distros, p, e - p);
bw.write(System.lineSeparator());
}
bw.write(Distros, p, Distros.length() - p);
bw.write(System.lineSeparator());
bw.flush();
}
catch(IOException ex) {
ex.printStackTrace();
}
which truly writes the strings without creating substrings. But it forces us to deal with potential exceptions, which PrintStream normally hides.

The method split(delimiter) returns string array from the string based on the delimiter, what you did create the string array in for each and the scope of it ended after for each so It's eligible for GC to release it
String Distros = "CentOS,RHEL,Debian,Ubuntu";
for (String s : Distros.split(",")) {
System.out.println(s);
}
, Is equivalent to
String Distros = "CentOS,RHEL,Debian,Ubuntu";
System.out.println("start scope");
{
String[] splitArray = Distros.split(",");
for (String s : splitArray) {
System.out.println(s);
}
}
System.out.println("end scope");

How to make a condition block in java execute only once?

Consider a file reading scenario, where the first line is a header. When looping the file line by line, an 'if condition' would be used to check whether the current line is the first line.
for(Line line : Lines)
{
if(firstLine)
{
//parse the headers
}else
{
//parse the body
}
}
Now since the control has come inside the 'if block', in all the other occurrences, it is waste to check the 'if condition'.
Is there any concept in java so that, once after executing, the particular line of code just vanishes away?
I think this would be a great feature if introduced or does it already exist?
Edit: Thank you for your answers. As I see, there were many approaches based on 'divide the data set'. So instead of 'first line', consider there is a 'special line'..
for(Line line : Lines) //1 million lines
{
if(specialLine)
{
//handle the special line
}else
{
//handle the rest
}
}
In the above code block, suppose the special line comes only after half way to 1 million iterations, how would you handle this scenario efficiently?

What about changing the logic a little!
So because the header is the first line you can get the firstLine and parse it alone, then start to parse the body starting from the second line :
List firstLine = Lines.get(0);
//parse the headers ^------------------Get the first line
for (int i = 1; i < Lines.size(); i++) {
// ^--------------------------------Parse the body starting from the 2nd line
//parse the body
}
In this case you don't need any verification.

Try to divde the dataset.
public List head(List list) {
return list.subList(0, 1);
}
public List tail(List list) {
return list.subList(1, list.size());
}
head(lines).foreach( lambda: //parse the header)
tail(lines).foreach( lambda: //parse the body)

You could (pseudo code) simply reverse things:
// parse headers
for(Line line: remainingLines)
{
But then: you would be using a simple boolean flag for that check. That if check is almost for free, and even more: when this loop sees so many repetitions that it is worth optimizing, the JIT will kick in and create optimized code for it.
And if the JIT doesn't kick in, it isn't worth optimizing this. And of course, there is also branch prediction on the hardware level - in other words: you can expect that modern hardware/software makes sure that the optimal thing happens. Without you doing much else besides writing clear, simple code.
In that sense, your question is a typical example of premature optimization. You worry on the wrong aspects. Focus on writing clean, simple code that gets the job done. Because that is what enables the JIT to do its runtime magic. And most optimisations that you apply at jave source code level do not matter at runtime.

The simplest answer is to use some kind of flags to ensure that you parse the first line as header only once.
boolean isHeaderParsed = false
for(Line line: Lines)
{
if(!isHeaderParsed && firstLine)
{
//parse the headers
isHeaderParsed = true
}else
{
//parse the body
}
}
This will achieve what you are looking for. But if you are looking for something more fancier, Interfaces is what you need.
Parser parser = new HeaderParser()
for(Line line: Lines)
{
parser = parser.parse();
}
// Implementations for Parser
public class HeaderParser implements Parser{
public Parser parse() {
// your business logic
return new BodyParser();
}
}
public class BodyParser implements Parser{
public Parser parse() {
// your logic
return this;
}
}
These are commonly referred to as Thunks

Since everybody else is posting some weird answers. You could have an interface.
interface StringConsumer{
void consume(String s);
}
Then just change interfaces after the first attempt.
StringConsumer firstLine = s->{System.out.printf("first line: %s\n", s);};
StringConsumer rest = s->{System.out.printf("more: %s\n", s);};
StringConsumer consumer = firstLine;
for(String line: lines){
consumer.consume(line);
consumer = rest;
}
If you didn't like the assignment, you can make consumer a field and change it's value in the firstConsumer consume method.
class StringProvider{
StringConsumer consumer;
List<String> lines;
//assume this is initialized and populated somewhere.
void process(){
StringConsumer rest = s->{System.out.printf("more: %s\n", s);};
consumer = s->{
System.out.printf("firstline: %s\n", s);
consumer = rest;
};
for(String line: lines){
consumer.consume(line);
}
}
}
Now the first consume call will switch the consumer that is being used. and subsequent calls will to the 'rest' consumer.

Have a boolean flag out side of if-else and change the status of flag when it entered in method. also evaluate this flag along with condition
boolean flag=true;
for(Line line: Lines)
{
if(firstLine && flag)
{
flag=false;
//parse the headers
}else
{
//parse the body
}
}

As no one said that, I'll just go ahead and say that this extra if statement is actually pretty cheap, performance wise.
At the lowest level of your machine, there is a thing called branch prediction. if statements tend to be pretty expensive when the hardware can't predict its result. But when it does get it right, testing a boolean with an if comes out pretty much for free. And since your if will 99.99% of the times evaluate to false, any hardware will do just fine
So don't worry about that too much. You're not wasting cycles there.
And answering your question... It kind of exists already. It doesn't vanish, but if you code it right it should be pretty close to vanished

which of the two is a better way of creating and destroying objects?

i have a question on lines 26 & 27:
String dumb = input.nextLine();
output.println(dumb.replaceAll(REMOVE, ADD));
i was hoping that i'd be able to shrink this down to a single line and be able to save space, so i did:
output.println(new String(input.nextLine()).replaceAll(REMOVE, ADD));
but now i'm wondering about performance. i understand that this program is quiet basic and doesn't need optimization, but i'd like to learn this.
the way i look at it, in the first scenario i'm creating a string object dumb, but once i leave the loop the object is abandoned and the JVM should clean it up, right? but does the JVM clean up the abandoned object faster than the program goes through the loop? or will there be several string objects waiting for garbage collection once the program is done?
and is my logic correct that in the second scenario the String object is created on the fly and destroyed once the program has passed through that line? and is this in fact a performance gain?
i'd appreciate it if you could clear this up for me.
thank you,
p.s. in case you are wondering about the program (i assumed it was straight forward) it takes in an input file, and output file, and two words, the program takes the input file, replaces the first word with the second and writes it into the second file. if you've actually read this far and would like to suggest ways i could make my code better, PLEASE DO SO. i'd be very grateful.
import java.io.File;
import java.util.Scanner;
import java.io.PrintWriter;
public class RW {
public static void main(String[] args) throws Exception{
String INPUT_FILE = args[0];
String OUTPUT_FILE = args[1];
String REMOVE = args[2];
String ADD = args[3];
File ifile = new File(INPUT_FILE);
File ofile = new File(OUTPUT_FILE);
if (ifile.exists() == false) {
System.out.println("the input file does not exists in the current folder");
System.out.println("please provide the input file");
System.exit(0);
}
Scanner input = new Scanner(ifile);
PrintWriter output = new PrintWriter(ofile);
while(input.hasNextLine()) {
String dumb = input.nextLine();
output.println(dumb.replaceAll(REMOVE, ADD));
}
input.close();
output.close();
}
}

The very, very first thing I'm going to say is this:
Don't worry about optimizing performance prematurely. The Java compiler is smart, it'll optimize a lot of this stuff for you, and even if it didn't you're optimizing out incredibly tiny amounts of time. The stream IO you've got going there is already running for orders of magnitude longer than the amount of time you're talking about.
What is most important is how easy the code is to understand. You've got a nice code style, going from your example, so keep that up. Which of the two code snippets is easier for someone other than you to read? That is the best option. :)
That said, here are some more specific answers to your questions:
Garbage collection will absolutely pick up objects which are instantiated inside the scope of a loop. The fact that it's instantiated inside the loop means that Java will already have marked it for clean up as soon as it fell out of scope. The next time GC runs, it will clean up all of those things which have been marked for clean up.
Creating an object inline will still create an object. The constructor is still called, memory is still allocated... Under the hood, they are really, really similar. It's just that in one case that object has a name, and in the other it doesn't. You're not going to save any real resources by combining two lines of code into one.
"input.nextLine()" already returns a String, so you don't need to wrap it in a new String(). (So yes, removing that actually will result in one less object being instantiated!)

Local Objects are eligible for GC once they go out of scope. That does not mean that GC cleans them that very moment. The eligible objects undergone a lifecycle. GC may or may not collect them immediately.
As far your program is concerned, there is nothing much to optimize except a line or two. Below is a restructured program.
import java.io.File;
import java.util.Scanner;
import java.io.PrintWriter;
public class Test {
public static void main(String[] args) throws Exception {
String INPUT_FILE = args[0];
String OUTPUT_FILE = args[1];
String REMOVE = args[2];
String ADD = args[3];
File ifile = new File(INPUT_FILE);
File ofile = new File(OUTPUT_FILE);
if (ifile.exists() == false) {
System.out.println("the input file does not exists in the current folder\nplease provide the input file");
System.exit(0);
}
Scanner input = null;
PrintWriter output = null;
try {
input = new Scanner(ifile);
output = new PrintWriter(ofile);
while (input.hasNextLine()) {
output.println(input.nextLine().replaceAll(REMOVE, ADD));
}
} finally {
if (input != null)
input.close();
if(output != null)
output.close();
}
}
}

If you arew concerned about obejct creation and performance, use a profiler to mesure your code. And keep in mind that doing new String(input.nextLine()) is totally pointless since input.nextLine() returns an immutable instance of String. So just do output.println(input.nextLine().replaceAll(REMOVE, ADD));.

What is more efficient StringBuffer new() or delete(0, sb.length())?

It is often argued that avoiding creating objects (especially in loops) is considered good practice.
Then, what is most efficient regarding StringBuffer?
StringBuffer sb = new StringBuffer();
ObjectInputStream ois = ...;
for (int i=0;i<1000;i++) {
for (j=0;i<10;j++) {
sb.append(ois.readUTF());
}
...
// Which option is the most efficient?
sb = new StringBuffer(); // new StringBuffer instance?
sb.delete(0,sb.length()); // or deleting content?
}
I mean, one could argue that creating an object is faster then looping through an array.

First StringBuffer is thread-safe which will have bad performance compared to StringBuilder. StringBuilder is not thread safe but as a result is faster. Finally, I prefer just setting the length to 0 using setLength.
sb.setLength(0)
This is similar to .delete(...) except you don't really care about the length. Also probably a little faster since it doesn't need to 'delete' anything. Creating a new StringBuilder (or StringBuffer) would be less efficient. Any time you see new Java is creating a new object and placing that on the heap.
Note: After looking at the implementation of .delete and .setLength, .delete sets length = 0, and .setLength sets every thing to '\0' So you may get a little win with .delete

Just to amplify the previous comments:
From looking at source, delete() always calls System.arraycopy(), but if the arguments are (0,count), it will call arraycopy() with a length of zero, which will presumably have no effect. IMHO, this should be optimized out since I bet it's the most common case, but no matter.
With setLength(), on the other hand, the call will increase the StringBuilder's capacity if necessary via a call to ensureCapacityInternal() (another very common case that should have been optimized out IMHO) and then truncates the length as delete() would have done.
Ultimately, both methods just wind up setting count to zero.
Neither call does any iterating in this case. Both make an unnecessary function call. However ensureCapacityInternal() is a very simple private method, which invites the compiler to optimize it nearly out of existence so it's likely that setLength() is slightly more efficient.
I'm extremely skeptical that creating a new instance of StringBuilder could ever be as efficient as simply setting count to zero, but I suppose that the compiler might recognize the pattern involved and convert the repeated instantiations into repeated calls to setLength(0). But at the very best, it would be a wash. And you're depending on the compiler to recognize the case.
Executive summary: setLength(0) is the most efficient. For maximum efficiency, pre-allocate the buffer space in StringBuilder when you create it.

The delete method is implemented this way:
public AbstractStringBuilder delete(int start, int end) {
if (start < 0)
throw new StringIndexOutOfBoundsException(start);
if (end > count)
end = count;
if (start > end)
throw new StringIndexOutOfBoundsException();
int len = end - start;
if (len > 0) {
System.arraycopy(value, start+len, value, start, count-end);
count -= len;
}
return this;
}
As you can see it doesn't iterate through the array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.