Performance in Reading data to memory java - java

I am trying to read a 512MB file into java memory. Here is my code:
String url_part = "/homes/t1.csv";
File f = new File(url_part);
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
ArrayList<String> mem = new ArrayList<String>();
System.out.println("Start loading.....");
System.gc();
double start = System.currentTimeMillis();
String line = br.readLine();
int count = 0;
while(line!=null){
line=br.readLine();
mem.add(line);
//System.out.println(count);
count++;
if(count%500000==0){
System.out.println(count);
}
}
The file contains 40000000 lines, the performance is totally fine before reading 18500000 lines, but it stucks somewhere after reading about 20000000 lines. (It freezes here, but continue after a long waiting, about 10seconds)
I kept track of the memory use, I found even the totaly file size is just 512 MB, the memory grows about 2GB when running the program. Also, the 8 core CPU keeps working at 100% utils.
I just want to read the file into memory so that later I can access the data I want faster from memory. Am I doing in the right way? THank!

First, Java stores strings in UTF-16, so if your input file contains mostly latin-1 symbols, then you will need twice more memory to store these symbols, thus 1Gb is used to store the chars. Second, there's an overhead per each line. We may roughly estimate it:
Reference from ArrayList to String - 4 bytes (assuming compressed oops)
Reference from String to char[] array - 4 bytes
String object header - at least 8 bytes
hash String field (to store hashCode) - 4 bytes
char[] object header - at least 8 bytes
char[] array length - 4 bytes
So in total at least 32 bytes will be wasted per each line. Usually it's more as objects must be padded. So for 20_000_000 lines you have at least 640_000_000 bytes overhead.

Related

High memory usage with Files.lines

I've found a few other questions on SO that are close to what I need but I can't figure this out. I'm reading a text file line by line and getting an out of memory error. Here's the code:
System.out.println("Total memory before read: " + Runtime.getRuntime().totalMemory()/1000000 + "MB");
String wp_posts = new String();
try(Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8)){
wp_posts = stream
.filter(line -> line.startsWith("INSERT INTO `wp_posts`"))
.collect(StringBuilder::new, StringBuilder::append,
StringBuilder::append)
.toString();
} catch (Exception e1) {
System.out.println(e1.getMessage());
e1.printStackTrace();
}
try {
System.out.println("wp_posts Mega bytes: " + wp_posts.getBytes("UTF-8").length/1000000);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
System.out.println("Total memory after read: " + Runtime.getRuntime().totalMemory()/1000000 + "MB");
Output is like (when run in an environment with more memory):
Total memory before read: 255MB
wp_posts Mega bytes: 18
Total memory after read: 1035MB
Note than in my production environment, I cannot increase the memory heap.
I've tried explicitly closing the stream, doing a gc, and putting stream in parallel mode (consumed more memory).
My questions are:
Is this amount of memory usage expected?
Is there a way to use less memory?
Your problem is in collect(StringBuilder::new, StringBuilder::append, StringBuilder::append). When you add smth to the StringBuilder and it has not enough internal array, then it double it and copy part from previous one.
Do new StringBuilder(int size) to predefine size of internal array.
Second problem, is that you have a big file, but as result you put it into a StringBuilder. This is very strange to me. Actually this is same as read whole file into a String without using Stream.
Your Runtime.totalMemory() calculation is pointless if you are allowing JVM to resize the heap. Java will allocate heap memory as needed as long as it doesn't exceed -Xmx value. Since JVM is smart it won't allocate heap memory by 1 byte at a time because it would be very expensive. Instead JVM will request a larger amount of memory at a time (actual value is platform and JVM implementation specific).
Your code is currently loading the content of the file into memory so there will be objects created on the heap. Because of that JVM most likely will request memory from the OS and you will observer increased Runtime.totalMemory() value.
Try running your program with strictly sized heap e.g. by adding -Xms300m -Xmx300m options. If you won't get OutOfMemoryError then decrease the heap until you get it. However you also need to pay attention to GC cycles, these things go hand in had and are a trade off.
Alternatively you can create a heap dump after the file is processed and then explore the data with MemoryAnalyzer.
The way you calculated memory is incorrect due to the following reasons:
You have taken the total memory (not the used memory). JVM allocates memory lazily and when it does, it does it in chunks. So, when it needs an additional 1 byte memory, it may allocate 1MB memory (provided the total memory does not exceed the configured max heap size). Thus a good portion of allocated heap memory may remain unused. Therefore, you need to calculate the used memory: Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
A good portion of the memory you see with the above formula maybe ready for garbage collection. JVM would definitely do the garbage collection before saying OutOfMemory. Therefore, to get an idea, you should do a System.gc() before calculating used memory. Ofcourse, you don't call gc in production and also calling gc does not guarantee that JVM would indeed trigger garbage collection. But for testing purpose, I think it works well.
You got the OutOfMemory when the stream processing was in progress. At that time the String was not formed and the StringBuilder had strong reference. You should call the capacity() method of StringBuilder to get the actual number of char elements in the array within StringBuilder and then multiply it by 2 to get the number of bytes because Java internally uses UTF16 which needs 2 bytes to store an ASCII character.
Finally, the way your code is written (i.e. not specifying a big enough size for StringBuilder initially), every time your StringBuilder runs out of space, it double the size of the internal array by creating a new array and copying the content. This means there will be triple the size allocated at a time than the actual String. This you cannot measure because it happens within the StringBuilder class and when the control comes out of StringBuilder class the old array is ready for garbage collection. So, there is a high chance that when you get the OutOfMemory error, you get it at that point in StringBuilder when it tries to allocate a double sized array, or more specifically in the Arrays.copyOf method
How much memory is expected to be consumed by your program as is? (A rough estimate)
Let's consider the program which is similar to yours.
public static void main(String[] arg) {
// Initialize the arraylist to emulate a
// file with 32 lines each containing
// 1000 ASCII characters
List<String> strList = new ArrayList<String>(32);
for (Integer i = 0; i < 32; i++) {
strList.add(String.format("%01000d", i));
}
StringBuilder str = new StringBuilder();
strList.stream().map(element -> {
// Print the number of char
// reserved by the StringBuilder
System.out.print(str.capacity() + ", ");
return element;
}).collect(() -> {
return str;
}, (response, element) -> {
response.append(element);
}, (response, element) -> {
response.append(element);
}).toString();
}
Here after every append, I'm printing the capacity of the StringBuilder.
The output of the program is as follows:
16, 1000, 2002, 4006, 4006, 8014, 8014, 8014, 8014,
16030, 16030, 16030, 16030, 16030, 16030, 16030, 16030,
32062, 32062, 32062, 32062, 32062, 32062, 32062, 32062,
32062, 32062, 32062, 32062, 32062, 32062, 32062,
If your file has "n" lines (where n is a power of 2) and each line has an average "m" ASCII characters, the capacity of the StringBuilder at the end of the program execution will be: (n * m + 2 ^ (a + 1) ) where (2 ^ a = n).
E.g. if your file has 256 lines and an average of 1500 ASCII characters per line, the total capacity of the StringBuilder at the end of program will be: (256 * 1500 + 2 ^ 9) = 384512 characters.
Assuming, you have only ASCII characters in you file, each character will occupy 2 bytes in UTF-16 representation. Additionally, everytime when the StringBuilder array runs out of space, a new bigger array twice the size of original is created (see the capacity growth numbers above) and the content of the old array is copied to the new array. The old array is then left for garbage collection. Therefore, if you add another 2 ^ (a+1) or 2 ^ 9 characters, the StringBuilder would create a new array for holding (n * m + 2 ^ (a + 1) ) * 2 + 2 characters and start copying the content of old array into the new array. Thus, there will be two big sized arrays within the StringBuilder as the copying activity goes on.
thus the total memory will be: 384512 * 2 + (384512 * 2 + 2 ) * 2 = 23,07,076 = 2.2 MB (approx.) to hold only 0.7 MB data.
I have ignored the other memory consuming items like array header, object header, references etc. as those will be negligible or constant compared to the array size.
So, in conclusion, 256 lines with 1500 characters each, consumes 2.2 MB (approx.) to hold only 0.7 MB data (one-third data).
If you had initialized the StringBuilder with the size 3,84,512 at the beginning, you could have accommodated the same number of characters in one-third memory and also there would have been much less work for CPU in terms of array copy and garbage collection
What you may consider doing instead
Finally, in such kind of problems, you may want to do it in chunks where you would write the content of your StringBuilder in a file or database as soon as it has processed 1000 records (say), clear the StringBuilder and start over again for the next batch of records. Thus you'd never hold more than 1000 (say) record worth of data in memory.

Can InputSteam.read overflow buffer

Does the read command check the size of the buffer when filling it with data or is there a chance that data is lost because buffer isn't big enough? In other words, if there are ten bytes of data available to be read, will the server continue to store the remaining 2 bytes of data until the next read.
I'm just using 8 as an example here to over dramatise the situation.
InputStream stdout;
...
while(condition)
{
...
byte[] buffer = new byte[8];
int len = stdout.read(buffer);
}
No, read() won't lose any data just because you haven't given it enough space for all the available bytes.
It's not clear what you mean by "the server" here, but the final two bytes of a 10 byte message would be available after the first read. (Or possible, the first read() would only read the first six bytes, leaving four still to read, for example.)

Android Inserting words into ArrayList, out of memory

I have two files, a dictionary containing words length 3 to 6 and a dictionary containing words 7. The words are stored in textfile separated with newlines. This method loads the file and inserts it into an arraylist which I store in an application class.
The file sizes are 386KB and 380 KB and contain less than 200k words each.
private void loadDataIntoDictionary(String filename) throws Exception {
Log.d(TAG, "loading file: " + filename);
AssetFileDescriptor descriptor = getAssets().openFd(filename);
FileReader fileReader = new FileReader(descriptor.getFileDescriptor());
BufferedReader bufferedReader = new BufferedReader(fileReader);
String word = null;
int i = 0;
MyApp appState = ((MyApp)getApplicationContext());
while ((word = bufferedReader.readLine()) != null) {
appState.addToDictionary(word);
word = null;
i++;
}
Log.d(TAG, "added " + i + " words to the dictionary");
bufferedReader.close();
}
The program crashes on an emulator running 2.3.3 with a 64MB sd card.
The errors being reported using logcat.
The heap grows past 24 MB. I then see clamp target GC heap from 25.XXX to 24.000 MB.
GC_FOR_MALLOC freed 0K, 12% free, external 1657k/2137K, paused 208ms.
GC_CONCURRENT freed XXK, 14% free
Out of memory on a 24-byte allocation and then FATAL EXCEPTION, memory exhausted.
How can I load these files without getting such a large heap?
Inside MyApp:
private ArrayList<String> dictionary = new ArrayList<String>();
public void addToDictionary(String word) {
dictionary.add(word);
}
Irrespective of any other problems/bugs, ArrayList can be very wasteful for this kind of storage, because as a growing ArrayList runs out of space, it doubles the size of its underlying storage array. So it's possible that nearly half of your storage is wasted. If you can pre-size a storage array or ArrayList to the correct size, then you may get significant saving.
Also (with paranoid data-cleansing hat on) make sure that there's no extra whitespace in your input files - you can use String.trim() on each word if necessary, or clean up the input files first. But I don't think this can be a significant problem given the file sizes you mention.
I'd expect your inputs to take less than 2MB to store the text itself (remember that Java uses UTF-16 internally, so would typically take 2 bytes per character) but there's maybe 1.5MB overhead for the String object references, plus 1.5MB overhead for the String lengths, and possibly the same again and again for the offset and hashcode (take a look at String.java)... whilst 24MB of heap still sounds a little excessive, it's not far off if you are getting the near-doubling effect of an unlucky ArrayList re-size.
In fact, rather than speculate, how about a test? The following code, run with -Xmx24M gets to about 560,000 6-character Strings before stalling (on a Java SE 7 JVM, 64-bit). It eventually crawls up to around 580,000 (with much GC thrashing, I imagine).
ArrayList<String> list = new ArrayList<String>();
int x = 0;
while (true)
{
list.add(new String("123456"));
if (++x % 1000 == 0) System.out.println(x);
}
So I don't think there's a bug in your code - storing large numbers of small Strings is just not very efficient in Java - for the test above it takes over 7 bytes per character because of all the overheads (which may differ between 32-bit and 64-bit machines, incidentally, and depend on JVM settings too)!
You might get slightly better results by storing an array of byte arrays, rather than ArrayList of Strings. There are also more efficient data structures for storing strings, such as Tries.

Size of a char on Android

According to Java specification, the size of a char data type is 16 bits or two bytes.
So, I have the written following code:
private static final int BUFFER_SIZE=1024;
char[] buffer=new char[BUFFER_SIZE];
BufferedReader br= new BufferedReader(new InputStreamReader(conn.getInputStream()));
while (true){
byteFromStream=in.read(buffer);
if (byteFromStream==-1) break;
totalBytesLoaded=totalBytesLoaded+byteFromStream*2;
}
But for some strange reason I am reading more bytes then is available on the stream, according to the specification of read() return numbers of characters actually read by stream.
Oh, I am getting total stream size by
bytesTotal=conn.getContentLength();
Which is working pretty fine as I myself uploaded files on the server and I know their sizes.
The method returns the amount of read characters. That value does not need to be multiplied by 2, especially since you cannot make that general assumption about the byte size of a character from a stream.
The amount of bytes per character depends on the character encoding (it can be 1 byte for example). The reader component knows that and only tells you the amount of read characters.

Read a very long string from console using Java Scanner takes time?

Currently I'm creating a console program that read a one line with very long String with java Scanner
sample data is more like this
50000 integer in one line separated by white-space,
"11 23 34 103 999 381 ....." until 50000 integer
This data is entered by user via console not from a File
here's my code
System.out.print("Input of integers : ");
Scanner sc = new Scanner(System.in);
long start = System.currentTimeMillis();
String Z = sc.nextLine();
long end = System.currentTimeMillis();
System.out.println("String Z created in "+(end-start)+"ms, Z character length is "+Z.length()+" characters");
Then I execute, as the result I've got this
String Z created within 49747ms, Z character length is 194539 characters
My question is why it takes a long time?
Is there any faster way to read a very long string?
I have tried buffered reader, but not much different..
String Z created within 41881ms, Z character length is 194539 characters
It looks like scanner uses a regular expression to match the end of line - this is likely causing the inefficiency, especially since you're matching regex against a 200k length String.
The pattern used is, effectively, .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$
My guess would be memory allocation, as it reads the line, it fills char buffer. And it gets larger and larger and needs to copy all so far readed text again and again. Each time it makes internal buffer Ntimes larger, so it is not atrociously slow, but for your huge line, it still is slow.
And processing of regexp itself does not help too. But my guess is that realocation and copying is the source of slowdown.
And maybe it needs to do GC to free memory to aquire, so another slowdown.
You can test my hypothesis by copying of Scanner and changing BUFFER_SIZE to equal your line length (or larger, to be sure).

Categories