High memory usage with Files.lines

High memory usage with Files.lines - java

I've found a few other questions on SO that are close to what I need but I can't figure this out. I'm reading a text file line by line and getting an out of memory error. Here's the code:
System.out.println("Total memory before read: " + Runtime.getRuntime().totalMemory()/1000000 + "MB");
String wp_posts = new String();
try(Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8)){
wp_posts = stream
.filter(line -> line.startsWith("INSERT INTO `wp_posts`"))
.collect(StringBuilder::new, StringBuilder::append,
StringBuilder::append)
.toString();
} catch (Exception e1) {
System.out.println(e1.getMessage());
e1.printStackTrace();
}
try {
System.out.println("wp_posts Mega bytes: " + wp_posts.getBytes("UTF-8").length/1000000);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
System.out.println("Total memory after read: " + Runtime.getRuntime().totalMemory()/1000000 + "MB");
Output is like (when run in an environment with more memory):
Total memory before read: 255MB
wp_posts Mega bytes: 18
Total memory after read: 1035MB
Note than in my production environment, I cannot increase the memory heap.
I've tried explicitly closing the stream, doing a gc, and putting stream in parallel mode (consumed more memory).
My questions are:
Is this amount of memory usage expected?
Is there a way to use less memory?

Your problem is in collect(StringBuilder::new, StringBuilder::append, StringBuilder::append). When you add smth to the StringBuilder and it has not enough internal array, then it double it and copy part from previous one.
Do new StringBuilder(int size) to predefine size of internal array.
Second problem, is that you have a big file, but as result you put it into a StringBuilder. This is very strange to me. Actually this is same as read whole file into a String without using Stream.

Your Runtime.totalMemory() calculation is pointless if you are allowing JVM to resize the heap. Java will allocate heap memory as needed as long as it doesn't exceed -Xmx value. Since JVM is smart it won't allocate heap memory by 1 byte at a time because it would be very expensive. Instead JVM will request a larger amount of memory at a time (actual value is platform and JVM implementation specific).
Your code is currently loading the content of the file into memory so there will be objects created on the heap. Because of that JVM most likely will request memory from the OS and you will observer increased Runtime.totalMemory() value.
Try running your program with strictly sized heap e.g. by adding -Xms300m -Xmx300m options. If you won't get OutOfMemoryError then decrease the heap until you get it. However you also need to pay attention to GC cycles, these things go hand in had and are a trade off.
Alternatively you can create a heap dump after the file is processed and then explore the data with MemoryAnalyzer.

The way you calculated memory is incorrect due to the following reasons:
You have taken the total memory (not the used memory). JVM allocates memory lazily and when it does, it does it in chunks. So, when it needs an additional 1 byte memory, it may allocate 1MB memory (provided the total memory does not exceed the configured max heap size). Thus a good portion of allocated heap memory may remain unused. Therefore, you need to calculate the used memory: Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
A good portion of the memory you see with the above formula maybe ready for garbage collection. JVM would definitely do the garbage collection before saying OutOfMemory. Therefore, to get an idea, you should do a System.gc() before calculating used memory. Ofcourse, you don't call gc in production and also calling gc does not guarantee that JVM would indeed trigger garbage collection. But for testing purpose, I think it works well.
You got the OutOfMemory when the stream processing was in progress. At that time the String was not formed and the StringBuilder had strong reference. You should call the capacity() method of StringBuilder to get the actual number of char elements in the array within StringBuilder and then multiply it by 2 to get the number of bytes because Java internally uses UTF16 which needs 2 bytes to store an ASCII character.
Finally, the way your code is written (i.e. not specifying a big enough size for StringBuilder initially), every time your StringBuilder runs out of space, it double the size of the internal array by creating a new array and copying the content. This means there will be triple the size allocated at a time than the actual String. This you cannot measure because it happens within the StringBuilder class and when the control comes out of StringBuilder class the old array is ready for garbage collection. So, there is a high chance that when you get the OutOfMemory error, you get it at that point in StringBuilder when it tries to allocate a double sized array, or more specifically in the Arrays.copyOf method
How much memory is expected to be consumed by your program as is? (A rough estimate)
Let's consider the program which is similar to yours.
public static void main(String[] arg) {
// Initialize the arraylist to emulate a
// file with 32 lines each containing
// 1000 ASCII characters
List<String> strList = new ArrayList<String>(32);
for (Integer i = 0; i < 32; i++) {
strList.add(String.format("%01000d", i));
}
StringBuilder str = new StringBuilder();
strList.stream().map(element -> {
// Print the number of char
// reserved by the StringBuilder
System.out.print(str.capacity() + ", ");
return element;
}).collect(() -> {
return str;
}, (response, element) -> {
response.append(element);
}, (response, element) -> {
response.append(element);
}).toString();
}
Here after every append, I'm printing the capacity of the StringBuilder.
The output of the program is as follows:
16, 1000, 2002, 4006, 4006, 8014, 8014, 8014, 8014,
16030, 16030, 16030, 16030, 16030, 16030, 16030, 16030,
32062, 32062, 32062, 32062, 32062, 32062, 32062, 32062,
32062, 32062, 32062, 32062, 32062, 32062, 32062,
If your file has "n" lines (where n is a power of 2) and each line has an average "m" ASCII characters, the capacity of the StringBuilder at the end of the program execution will be: (n * m + 2 ^ (a + 1) ) where (2 ^ a = n).
E.g. if your file has 256 lines and an average of 1500 ASCII characters per line, the total capacity of the StringBuilder at the end of program will be: (256 * 1500 + 2 ^ 9) = 384512 characters.
Assuming, you have only ASCII characters in you file, each character will occupy 2 bytes in UTF-16 representation. Additionally, everytime when the StringBuilder array runs out of space, a new bigger array twice the size of original is created (see the capacity growth numbers above) and the content of the old array is copied to the new array. The old array is then left for garbage collection. Therefore, if you add another 2 ^ (a+1) or 2 ^ 9 characters, the StringBuilder would create a new array for holding (n * m + 2 ^ (a + 1) ) * 2 + 2 characters and start copying the content of old array into the new array. Thus, there will be two big sized arrays within the StringBuilder as the copying activity goes on.
thus the total memory will be: 384512 * 2 + (384512 * 2 + 2 ) * 2 = 23,07,076 = 2.2 MB (approx.) to hold only 0.7 MB data.
I have ignored the other memory consuming items like array header, object header, references etc. as those will be negligible or constant compared to the array size.
So, in conclusion, 256 lines with 1500 characters each, consumes 2.2 MB (approx.) to hold only 0.7 MB data (one-third data).
If you had initialized the StringBuilder with the size 3,84,512 at the beginning, you could have accommodated the same number of characters in one-third memory and also there would have been much less work for CPU in terms of array copy and garbage collection
What you may consider doing instead
Finally, in such kind of problems, you may want to do it in chunks where you would write the content of your StringBuilder in a file or database as soon as it has processed 1000 records (say), clear the StringBuilder and start over again for the next batch of records. Thus you'd never hold more than 1000 (say) record worth of data in memory.

Related

How much memory does HashMap consume for N objects? [duplicate]

I was asked in an interview to calculate the memory usage for HashMap and how much estimated memory it will consume if you have 2 million items in it.
For example:
Map <String,List<String>> mp=new HashMap <String,List<String>>();
The mapping is like this.
key value
----- ---------------------------
abc ['hello','how']
abz ['hello','how','are','you']
How would I estimate the memory usage of this HashMap Object in Java?

The short answer
To find out how large an object is, I would use a profiler. In YourKit, for example, you can search for the object and then get it to calculate its deep size. This will give a you a fair idea of how much memory would be used if the object were stand alone and is a conservative size for the object.
The quibbles
If parts of the object are re-used in other structures e.g. String literals, you won't free this much memory by discarding it. In fact discarding one reference to the HashMap might not free any memory at all.
What about Serialisation?
Serialising the object is one approach to getting an estimate, but it can be wildly off as the serialisation overhead and encoding is different in memory and to a byte stream. How much memory is used depends on the JVM (and whether its using 32/64-bit references), but the Serialisation format is always the same.
e.g.
In Sun/Oracle's JVM, an Integer can take 16 bytes for the header, 4 bytes for the number and 4 bytes padding (the objects are 8-byte aligned in memory), total 24 bytes. However if you serialise one Integer, it takes 81 bytes, serialise two integers and they takes 91 bytes. i.e. the size of the first Integer is inflated and the second Integer is less than what is used in memory.
String is a much more complex example. In the Sun/Oracle JVM, it contains 3 int values and a char[] reference. So you might assume it uses 16 byte header plus 3 * 4 bytes for the ints, 4 bytes for the char[], 16 bytes for the overhead of the char[] and then two bytes per char, aligned to 8-byte boundary...
What flags can change the size?
If you have 64-bit references, the char[] reference is 8 bytes long resulting in 4 bytes of padding. If you have a 64-bit JVM, you can use +XX:+UseCompressedOops to use 32-bit references. (So look at the JVM bit size alone doesn't tell you the size of its references)
If you have -XX:+UseCompressedStrings, the JVM will use a byte[] instead of a char array when it can. This can slow down your application slightly but could improve you memory consumption dramatically. When a byte[] in used, the memory consumed is 1 byte per char. ;) Note: for a 4-char String, as in the example, the size used is the same due to the 8-byte boundary.
What do you mean by "size"?
As has been pointed out, HashMap and List is more complex as many, if not all, the Strings can be reused, possibly String literals. What you mean by "size" depends on how it is used. i.e. How much memory would the structure use alone? How much would be freed if the structure were discarded? How much memory would be used if you copied the structure? These questions can have different answers.
What can you do without a profiler?
If you can determine that the likely conservative size, is small enough, the exact size doesn't matter. The conservative case is likely to where you construct every String and entry from scratch. (I only say likely as a HashMap can have capacity for 1 billion entries even though it is empty. Strings with a single char can be a sub-string of a String with 2 billion characters)
You can perform a System.gc(), take the free memory, create the objects, perform another System.gc() and see how much the free memory has reduced. You may need to create the object many times and take an average. Repeat this exercise many times, but it can give you a fair idea.
(BTW While System.gc() is only a hint, the Sun/Oracle JVM will perform a Full GC every time by default)

I think that the question should be clarified because there is a difference between the size of the HashMap and the size of HashMap + the objects contained by the HashMap.
If you consider the size of the HashMap, in the example you provided, the HashMap stores one reference to the String "aby" and one reference to the List. So the multiple elements in the list do not matter. Only the reference to the list is stored in the value.
In a 32 bits JVM, in one Map entry, you have 4 bytes for the "aby" reference + 4 bytes for the List reference + 4 bytes for the "hashcode" int property of Map entry + 4 bytes for the "next" property of Map entry.
You also add the 4*(X-1) bytes references where the "X" is the number of empty buckets that the HashMap has created when you called the constructor new HashMap<String,List<String>>()
. According to http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html, it should be 16.
There are also loadFactor, modCount, threshold and size which are all primitive int type (16 more bytes) and header (8bytes).
So in the end, the size of your above HashMap would be 4 + 4 + 1 + (4*15) + 16 + 8 = 93 bytes
This is an approximation based on data that are owned by the HashMap. I think that maybe the interviewer was interested in seeing if you were aware of the way HashMap works (the fact for example that the default constructor create and array of 16 buckets for Map entry, the fact that the sizes of the objects stored in the HashMap do not affect the HashMap size since it only store the references).
HashMap are so widely used that under certain circumstances, it should be worth using the constructors with initial capacity and load factor.

Summary:
memory = hashmap_array_size*bucket_size
+ n*chained_list_node_size
+ sum(key string sizes)
+ sum(list_string_size(string_list) for each hashmap List<String> value)
= 254 MB
(theoretical in-interview estimate)
Test program total-memory-used-size for 2 million sample entries: (see below)
= 640 MB
(I recommend a simple test program like this for a quick true-total-size estimate)
A minimal estimate (actual implementation probably has a bit more overhead):
Assumed data structure:
Bucket: (Pointer to String key, Pointer to hash-chain-list first-node)
Chained List Node: (Pointer to List<String> value, Next-pointer)
(HashMap is a chained hash - each bucket has a list/tree of values)
(as of Java 8, the list switches to a tree after 8 items)
List<String> instance: (Pointer to first node)
List<String> Node: (Pointer to String value, Next-pointer)
Assumption to simplify this estimate: zero collisions, each bucket has max 1 value (ask interviewer if this is ok - to give a rough, initial answer)
Assumption: 64-bit JVM so 64-bit pointers so pointer_size=8 bytes
Assumption: HashMap underlying array is 50% full (by default, at 75% full, the hashmap is rehashed with double the size), so hashmap_array_size = 2*n
memory = hashmap_array_size*bucket_size
+ n*chained_list_node_size
+ sum(key string sizes)
+ sum(list_string_size(string_list) for each hashmap List<String> value)
So:
memory = (n*2)*(8*2)
+ n*(8*2) + ((2 length_field + 3 string_length)*n)
+ (n*(8 + 3*(8*2)
+ 3*(2 length_field + 4 string_length))
= 2000000*(2*8*2 + 8*2 + (2+3) + (8 + 3*8*2 + 3*(2+4)))
= 254000000
= 254 MB
n = number of items in the hash map
bucket_size = pointer_size*2
chained_list_node_size = pointer_size*2
list_string_size(list) = pointer_size +
list.size()*list_string_node_size
+ sum(string value sizes in this List<String> list)
list_string_node_size = pointer_size*2
String length bytes = length_field_size + string_characters
(UTF-8 is 1 byte per ascii character)
(length_field_size = size of integer = 2)
Assume all keys are length 3.
(we have to assume something to calculate space used)
so: sum(key string sizes) = (2 length_field + 3 string_length)*n
Assume all value string-lists are length 3 and each string is of length 4. So:
sum(list_string_size(string_list) for each hashmap List<String> value)
= n*(8 + 3*(8*2) + 3*(2 length_field + 4 string_length))
A simple test program would give a better real answer:
import java.util.*;
class TempTest {
public static void main(String[] args) {
HashMap<String, List<String>> map = new HashMap<>();
System.gc();
printMemory();
for (int i = 0; i < 2000000; ++i) {
map.put(String.valueOf(i), Arrays.asList(String.valueOf(i), String.valueOf(i) + "b", String.valueOf(i) + "c"));
}
System.gc();
printMemory();
}
private static void printMemory() {
Runtime runtime = Runtime.getRuntime();
long totalMemory = runtime.totalMemory();
long freeMemory = runtime.freeMemory();
System.out.println("Memory: Used=" + (totalMemory - freeMemory) + " Total=" + totalMemory + " Free=" + freeMemory);
}
}
For me, this took 640MB (after.Used - before.Used).

you can't know in advance without knowing what all the strings are, and how many items are in each list, or without knowing if the strings are all unique references.
The only way to know for sure, is to serialize the whole thing to a byte array (or temp file) and see exactly how many bytes that was.

Android Inserting words into ArrayList, out of memory

I have two files, a dictionary containing words length 3 to 6 and a dictionary containing words 7. The words are stored in textfile separated with newlines. This method loads the file and inserts it into an arraylist which I store in an application class.
The file sizes are 386KB and 380 KB and contain less than 200k words each.
private void loadDataIntoDictionary(String filename) throws Exception {
Log.d(TAG, "loading file: " + filename);
AssetFileDescriptor descriptor = getAssets().openFd(filename);
FileReader fileReader = new FileReader(descriptor.getFileDescriptor());
BufferedReader bufferedReader = new BufferedReader(fileReader);
String word = null;
int i = 0;
MyApp appState = ((MyApp)getApplicationContext());
while ((word = bufferedReader.readLine()) != null) {
appState.addToDictionary(word);
word = null;
i++;
}
Log.d(TAG, "added " + i + " words to the dictionary");
bufferedReader.close();
}
The program crashes on an emulator running 2.3.3 with a 64MB sd card.
The errors being reported using logcat.
The heap grows past 24 MB. I then see clamp target GC heap from 25.XXX to 24.000 MB.
GC_FOR_MALLOC freed 0K, 12% free, external 1657k/2137K, paused 208ms.
GC_CONCURRENT freed XXK, 14% free
Out of memory on a 24-byte allocation and then FATAL EXCEPTION, memory exhausted.
How can I load these files without getting such a large heap?
Inside MyApp:
private ArrayList<String> dictionary = new ArrayList<String>();
public void addToDictionary(String word) {
dictionary.add(word);
}

Irrespective of any other problems/bugs, ArrayList can be very wasteful for this kind of storage, because as a growing ArrayList runs out of space, it doubles the size of its underlying storage array. So it's possible that nearly half of your storage is wasted. If you can pre-size a storage array or ArrayList to the correct size, then you may get significant saving.
Also (with paranoid data-cleansing hat on) make sure that there's no extra whitespace in your input files - you can use String.trim() on each word if necessary, or clean up the input files first. But I don't think this can be a significant problem given the file sizes you mention.
I'd expect your inputs to take less than 2MB to store the text itself (remember that Java uses UTF-16 internally, so would typically take 2 bytes per character) but there's maybe 1.5MB overhead for the String object references, plus 1.5MB overhead for the String lengths, and possibly the same again and again for the offset and hashcode (take a look at String.java)... whilst 24MB of heap still sounds a little excessive, it's not far off if you are getting the near-doubling effect of an unlucky ArrayList re-size.
In fact, rather than speculate, how about a test? The following code, run with -Xmx24M gets to about 560,000 6-character Strings before stalling (on a Java SE 7 JVM, 64-bit). It eventually crawls up to around 580,000 (with much GC thrashing, I imagine).
ArrayList<String> list = new ArrayList<String>();
int x = 0;
while (true)
{
list.add(new String("123456"));
if (++x % 1000 == 0) System.out.println(x);
}
So I don't think there's a bug in your code - storing large numbers of small Strings is just not very efficient in Java - for the test above it takes over 7 bytes per character because of all the overheads (which may differ between 32-bit and 64-bit machines, incidentally, and depend on JVM settings too)!
You might get slightly better results by storing an array of byte arrays, rather than ArrayList of Strings. There are also more efficient data structures for storing strings, such as Tries.

Heap: Survivor Space

I wrote a sample java application which allocates memory and then running forever.
why is the memory used by the survivor space 0kbytes ?!
List<String> stringlist = new ArrayList<String>();
while (true) {
stringlist.add("test");
if (stringlist.size() >= 5000000)
break;
}
while (true)
for (String s : stringlist);

Because "test" is a String literal it will end up in permanent memory not heap.
Memory size of objects you create is 5000000 + 4*2 ~ 5MB which will easily fit into Eden space.
Modify
stringlist.add("test");
to
stringlist.add(new String("test"));
and you will get 5000000 * 4 * 2 =38MB which most probably will still fit into Eden. You can either increase your list size or String length to make sure you have survivors.

"test" is a String literal and, regardless of how it’s stored (this has changed during the Java development), the important point here is, that it is a single object.
Recall the Java Language Specification:
…a string literal always refers to the same instance of class String. This is because string literals - or, more generally, strings that are the values of constant expressions (§15.28) - are "interned" so as to share unique instances, using the method String.intern
So there are no new Strings created within your loop as "test" always refers to the same String instance. The only heap change occurs when the ArrayList’s internal capacity is exhausted.
The memory finally required for the ArrayList’s internal array depends on the size of an object reference, usually it’s 5000000*4 bytes for 32Bit JVMs and 64Bit JVMs with compressed oops and 5000000*8 bytes for 64Bit JVMs without compressed oops.
The interesting point here is described on www.kdgregory.com:
if your object is large enough, it will be created directly in the tenured generation. User-defined objects won't (shouldn't!) have anywhere near the number of members needed to trigger this behavior, but arrays will: with JDK 1.5, arrays larger than a half megabyte go directly into the tenured generation.
This harmonizes with these words found on oracle.com:
If survivor spaces are too small, copying collection overflows directly into the tenured generation.
which gives another reason why larger arrays might not show up in the survivor space. So it depends on the exact configuration whether they do not appear because the were copied from Eden space to Tenured Generation or were created in the Tenured Generation in the first place. But the result of not showing up in the survivor space is the same.
So when the ArrayList is created with its default capacity of 10, the internal array is smaller than this threshold and so are the next ones to be created on each capacity enlargements. However, at the time the new array exceeds this threshold, all old ones are garbage and hence won’t show up as “survivors”.
So at the end of the first loop you have only one remaining array which has a size exceeding the threshold by far and hence bypassed the Survivor space. Your second loop does not add anything to the memory management. It creates temporary Iterators but these never “survive”.

Reducing memory churn when processing large data set

Java has a tendency to create a large number objects that needs to be garbage collected when processing large data set. This happens fairly frequently when streaming a amounts of data from the database, creating reports, etc. Is there a strategy to reduce the memory churn.
In this example, the object based version spends significant amount of times (2+ seconds) generating objects and performing garbage collection whereas the boolean array version completes in a fraction of a section without any garbages collection whatsoever.
How do I reduce the memory churn (the need for large number of garbage collections) when processing large data sets?
java -verbose:gc -Xmx500M UniqChars
...
----------------
[GC 495441K->444241K(505600K), 0.0019288 secs] x 45 times
70000007
================
70000007
import java.util.HashSet;
import java.util.Set;
public class UniqChars {
static String a=null;
public static void main(String [] args) {
//Generate data set
StringBuffer sb=new StringBuffer("sfdisdf");
for (int i =0; i< 10000000; i++) {
sb.append("sfdisdf");
}
a=sb.toString();
sb=null; //free sb
System.out.println("----------------");
compareAsSet();
System.out.println("================");
compareAsAry();
}
public static void compareAsSet() {
Set<String> uniqSet = new HashSet<String>();
int n=0;
for(int i=0; i<a.length(); i++) {
String chr = a.substring(i,i);
uniqSet.add(chr);
n++;
}
System.out.println(n);
}
public static void compareAsAry() {
boolean uniqSet[] = new boolean[65536];
int n=0;
for(int i=0; i<a.length(); i++) {
int chr = (int) a.charAt(i);
uniqSet[chr]=true;
n++;
}
System.out.println(n);
}
}

Well as pointed out by one of the comments it's your code, not Java at fault for memory churn. So let's see you've written this code that builds an insanely large String from a StringBuffer. Calls toString() on it. Then calls substring() on that insanely large string which is in a loop and creating new a.length() Strings. Then does some in place junk on an array that really will perform pretty damn fast since there is no object creation, but ultimately writes to true to the same 5-6 locations in a huge array. Waste much? So what did you think would happen? Ditch StringBuffer and use StringBuilder since it's not fully synchronized which will be a little faster.
Ok so here's where your algorithm is probably spending its time. See the StringBuffer is allocating an internal character array to store stuff in each time you call append(). When that character array fills entirely up, it has to allocate a larger character array, copy all that junk you just wrote to it into the new array, then append what you originally called it with. So your code is allocating filling up, allocating a bigger chunk, copying that junk to the new array, then repeating that process until it does that 1000000 times. You can speed that up by pre-allocating the character array for the StringBuffer. Roughly that's 10000000 * "sfdisdf".length(). That will keep Java from creating tons of memory that it just dumps over and over.
Next is the compareAsSet() mess. Your line String chr = a.substring(i,i); is creating NEW strings a.length() times. Well since you're doing a.substring(i,i) is only a character you could just charAt(i) then there's no allocating happen. There's also an option of CharSequence which doesn't create a new String with it's own character array but simply points to the original underlying char[] with an offset and length. String.subSequence()
You plug this same code in any other language and it'll suck there too. In fact I'd say far far worse. Just try this is C++ and watch it be significantly worse than Java should you allocate and deallocate this much. See Java memory allocation is way way way faster than C++ because everything in Java is allocated from a memory pool so creating objects is magnitudes faster. But, there are limits. Furthermore, Java compresses its memory should it become too fragmented, C++ doesn't. So as you allocate memory and dump it, just in the same way, you'll probably run the risk of fragmenting the memory in C++. That could mean your StringBuffer might run out of the ability to grow large enough to finish and would crash.
In fact that might also explain some of the performance issues with GC because it's having to make room more a continuous block big enough after lots of trash has been taken out. So Java is not only cleaning up the memory its also having to compress the memory address space so it can get a block big enough for your StringBuffer.
Anyway, I'm sure your just testing the tires, but testing with code like this isn't really smart because it'll never perform well because it's unrealistic memory allocation. You know the old adage Garbage In Garbage Out. And that's what you got Garbage.

In your example your two methods are doing very different things.
In compareAsSet() you are generating the same 4 Strings ("s", "d", "f" and "i") and calling String.hashCode() and String.equals(String) (HashSet does this when you try to add them) 70000007 times. What you end up with is a HashSet of size 4. While you are doing this you are allocating String objects each time String.substring(int, int) returns which will force a minor collection every time the 'new' generation of the garbage collector gets filled.
In compareAsAry() you've allocated a single array 65536 elements wide changed some values in it and and then it goes out of scope when the method returns. This is a single heap memory operation vs 70000007 done in compareAsSet. You do have a local int variable being changed 70000007 times but this happens in stack memory not in heap memory. This method does not really generate that much garbage in the heap compared to the other method (basically just the array).
Regarding churn your options are recycle objects or tuning the garbage collector.
Recycling is not really possible with Strings in general as they are immutable, though the VM may perform interning operations this only reduces total memory footprint not garbage churn. A solution targeted for the above scenario that recycles could be generated but the implementation would be brittle and inflexible.
Tuning the garbage collector so that the 'new' generation is larger could reduce the total number of collections that has to be performed during your method call and thus increase the throughput of the call, you could also just increase the heap size in general which would accomplish the same thing.
For futher reading on garbage collector tuning in Java 6 I recommend the Oracle white paper linked below.
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

For comparison, if you wrote this it would do the same thing.
public static void compareLength() {
// All the loop does is count the length in a complex way.
System.out.println(a.length());
}
// I assume you intended to write this.
public static void compareAsBitSet() {
BitSet uniqSet = new BitSet();
for(int i=0; i<a.length(); i++)
uniqSet.set(a.charAt(i));
System.out.println(uniqSet.size());
}
Note: the BitSet uses 1 bit per element, rather than 1 byte per element. It also expands as required so say you have ASCII text, the BitSet might use 128-bits or 16 bytes (plus 32-byte overhead) The boolean[] uses 64 KB which is much higher. Ironically, using a boolean[] can be faster as it involves less bit shifting and only the portion of the array used needs to be in memory.
As you can see, with either solution, you get a much more efficient result because you use a better algorithm for what needs to be done.

Why do I get "Not enough storage is available to process this command" using Java MappedByteBuffers?

I have a very large array of doubles that I am using a disk-based file and a paging List of MappedByteBuffers to handle, see this question for more background. I am running on Windows XP using Java 1.5.
Here is the key part of my code that does the allocation of the buffers against the file...
try
{
// create a random access file and size it so it can hold all our data = the extent x the size of a double
f = new File(_base_filename);
_filename = f.getAbsolutePath();
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
// make enough MappedByteBuffers to handle the whole lot
_pagesize = bytes_extent;
long pages = 1;
long diff = 0;
while (_pagesize > MAX_PAGE_SIZE)
{
_pagesize /= PAGE_DIVISION;
pages *= PAGE_DIVISION;
// make sure we are at double boundaries. We cannot have a double spanning pages
diff = _pagesize % BLOCK_SIZE;
if (diff != 0) _pagesize -= diff;
}
// what is the difference between the total bytes associated with all the pages and the
// total overall bytes? There is a good chance we'll have a few left over because of the
// rounding down that happens when the page size is halved
diff = bytes_extent - (_pagesize * pages);
if (diff > 0)
{
// check whether adding on the remainder to the last page will tip it over the max size
// if not then we just need to allocate the remainder to the final page
if (_pagesize + diff > MAX_PAGE_SIZE)
{
// need one more page
pages++;
}
}
// make the byte buffers and put them on the list
int size = (int) _pagesize ; // safe cast because of the loop which drops maxsize below Integer.MAX_INT
int offset = 0;
for (int page = 0; page < pages; page++)
{
offset = (int) (page * _pagesize );
// the last page should be just big enough to accommodate any left over odd bytes
if ((bytes_extent - offset) < _pagesize )
{
size = (int) (bytes_extent - offset);
}
// map the buffer to the right place
MappedByteBuffer buf = _ioChannel.map(FileChannel.MapMode.READ_WRITE, offset, size);
// stick the buffer on the list
_bufs.add(buf);
}
Controller.g_Logger.info("Created memory map file :" + _filename);
Controller.g_Logger.info("Using " + _bufs.size() + " MappedByteBuffers");
_ioChannel.close();
_ioFile.close();
}
catch (Exception e)
{
Controller.g_Logger.error("Error opening memory map file: " + _base_filename);
Controller.g_Logger.error("Error creating memory map file: " + e.getMessage());
e.printStackTrace();
Clear();
if (_ioChannel != null) _ioChannel.close();
if (_ioFile != null) _ioFile.close();
if (f != null) f.delete();
throw e;
}
I get the error mentioned in the title after I allocate the second or third buffer.
I thought it was something to do with contiguous memory available, so have tried it with different sizes and numbers of pages, but to no overall benefit.
What exactly does "Not enough storage is available to process this command" mean and what, if anything, can I do about it?
I thought the point of MappedByteBuffers was the ability to be able to handle structures larger than you could fit on the heap, and treat them as if they were in memory.
Any clues?
EDIT:
In response to an answer below (#adsk) I changed my code so I never have more than a single active MappedByteBuffer at any one time. When I refer to a region of the file that is currently unmapped I junk the existing map and create a new one. I still get the same error after about 3 map operations.
The bug quoted with GC not collecting the MappedByteBuffers still seems to be a problem in JDK 1.5.

I thought the point of MappedByteBuffers was the ability to be able to handle structures larger than you could fit on the heap, and treat them as if they were in memory.
No. The idea is / was to allow you to address more than to 2**31 doubles ... on the assumption that you had enough memory, and were using a 64 bit JVM.
(I am assuming that this is a followup question to this question.)
EDIT: Clearly, more explanation is needed.
There are a number of limits that come into play.
Java has a fundamental restriction that the length attribute of an array, and array indexes have type int. This, combined with the fact that int is signed and an array cannot have a negative size means that the largest possible array can have 2**31 elements. This restriction applies to 32bit AND 64bit JVMs. It is a fundamental part of the Java language ... like the fact that char values go from 0 to 65535.
Using a 32bit JVM places a (theoretical) upper bound of 2**32 on the number of bytes that are addressable by the JVM. This includes, the entire heap, your code, and library classes that you use, the JVM's native code core, memory used for mapped buffers ... everything. (In fact, depending on your platform, the OS may give you considerably less than 2**32 bytes if address space.)
The parameters that you give on the java command line determine how much heap memory the JVM will allow your application to use. Memory mapped to using MappedByteBuffer objects does not count towards this.
The amount of memory that the OS will give you depends (on Linux/UNIX) on the total amount of swap space configured, the 'process' limits and so on. Similar limits probably apply to Windows. And of course, you can only run a 64bit JVM if the host OS is 64bit capable, and you are using 64bit capable hardware. (If you have a Pentium, you are plain out of luck.)
Finally, the amount of physical memory in your system comes into play. In theory, you can ask your JVM to use a heap, etc that is many times bigger than than your machine's physical memory. In practice, this is a bad idea. If you over allocate virtual memory, your system will thrash and application performance will go through the floor.
The take away is this:
If you use a 32 bit JVM, you probably are limited to somewhere between 2**31 and 2**32 bytes of addressable memory. That's enough space for a MAXIMUM of between 2**29 and 2**30 doubles, whether you use an array or a mapped Buffer.
If you use a 64 bit JVM, you can represent a single array of 2**31 doubles. The theoretical limit of a mapped Buffer would be 2**63 bytes or 2**61 doubles, but the practical limit would roughly the amount of physical memory your machine has.

When memory mapping a file, it is possible to run out of address space in 32-bit VM. This happens even if the file is mapped in small chunks and those ByteBuffers are no longer reachable. The reason is that GC never kicks in to free the buffers.
Refer the bug at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6417205

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.