Any way to compress java arraylist? - java

I have a data structure:
ArrayList<String>[] a = new ArrayList[100000];
each list has about 1000 strings with about 100 characters.
I'm doing an one-off job with it, and it cost a little more memory than I can bear.
I think I can change less code if I can find ways to reduce some memory cost , as the cost is not too much , and it's just an one-off job. So, please tell me all possible ways you know.
add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now. But I don't know the size of each arraylist before I work through all the data.
And the problem is indeed too much data, so I want to find ways to compress it. It's not a allocation problem. There will finally be too much data to exceed the memory.

it cost a little more memory than I can bear
So, how much is "a little"?
Some quick estimates:
You have collections of string of 1000x100 characters. That should be about 1000x100x2 = 200kb of string data.
If you have 100000 of those, you'll need almost 20Gb for the data alone.
Compared to the 200kb of each collection's data the overhead of your data structures is miniscule, even if it was 100 bytes for each collection (0.05%).
So, not much to be gained here.
Hence, the only viable ways are:
Data compression of some kind to reduce the size of the 20Gb payload
Use of external storage, e.g. by only reading in strings which are needed at the moment and then discarding them
To me, it is not clear if your memory problem really comes from the data structure you showed (did you profile the program?) or from the total memory usage of the program. As I commented on another answer, resizing an array(list) for instance temporarily requires at least 2x the size of the array(list) for the copying operation. Then notice that you can create memory leaks in Java - or just be holding on to data you actually won't need again.
Edit:
A String in Java consists of an array of chars. Every char occupies two bytes.
You can convert a String to a byte[], where any ASCII character should need one byte only (non-ASCII characters will still need 2 (or more) bytes):
str.getBytes(Charset.forName("UTF-8"))
Then you make a Comparator for byte[] and you're good to go. (Notice though that byte has a range of [-128,127] which makes comparing non-intuitive in this case; you may want to compare (((int)byteValue) & 0xff).)

Why are you using Arrays when you don't know the size at compile time itself, Size is the main concern why Linked lists are preferable over arrays
ArrayList< String>[] a = new ArrayList[100000];
Why are you allocating so much memory at once initially, ArrayList will resize itself whenever required you need not do it, manually.
I think below structure will suffice your requirement
List<List<String> yourListOfStringList = new ArrayList<>();

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?
tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

Spark streaming gc setup questions

My logic is as follows.
Use createDirectStream to get a topic by log type in Kafka.
After repartition, the log is processed through various processing.
Create a single string using combineByKey for each log type (use StringBuilder).
Finally, save to HDFS by log type.
There are a lot of operations that add strings, so GC happens frequently.
How is it better to set up GC in this situation?
//////////////////////
There are various logic, but I think there is a problem in doing combineByKey.
rdd.combineByKey[StringBuilder](
(s: String) => new StringBuilder(s),
(sb: StringBuilder, s: String) => sb.append(s),
(sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)
The simplest thing with greatest impact you can do with that combineByKey expression is to size the StringBuilder you create so that it does not have to expand its backing character array as you merge string values into it; the resizing amplifies the allocation rate and wastes memory bandwidth by copying from old to new backing array. As a guesstimate, I would say pick the 90th percentile of string length of the resulting data set's records.
A second thing to look at (after collecting some statistics on your intermediate values) would be for your combiner function to pick the StringBuilder instance that has room to fit in the other one when you call sb1.append(sb2).
A good thing to take care of would be to use Java 8; it has optimizations that make a significant difference when there's heavy work on strings and string buffers.
Last but not least, profile to see where you are actually spending your cycles. This workload (excluding any additional custom processing you are doing) shouldn't need to promote a lot of objects (if any) to old generation, so you should make sure that young generation has ample size and is collected in parallel.

JVM Tunning of Java Class

My java class reads in a 60MB file and produces a HashMap of a HashMap with over 300 million records.
HashMap<Integer, HashMap<Integer, Double>> pairWise =
new HashMap<Integer, HashMap<Integer, Double>>();
I already tunned the VM argument to be:
-Xms512M -Xmx2048M
But system still goes for:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.createEntry(HashMap.java:869)
at java.util.HashMap.addEntry(HashMap.java:856)
at java.util.HashMap.put(HashMap.java:484)
at com.Kaggle.baseline.BaselineNew.createSimMap(BaselineNew.java:70)
at com.Kaggle.baseline.BaselineNew.<init>(BaselineNew.java:25)
at com.Kaggle.baseline.BaselineNew.main(BaselineNew.java:315)
How big of the heap will it take to run without failing with an OOME?
Your dataset is ridiculously large to process it in memory, this is not a final solution, just an optimization.
You're using boxed primitives, which is a very painful thing to look at.
According to this question, a boxed integer can be 20 bytes larger than an unboxed integer. This is not what I call memory efficient.
You can optimize this with specialized collections, which don't box the primitive values. One project providing these is Trove. You could use a TIntDoubleMap instead of your HashMap<Integer, Double> and a TIntObjectHashMap instead of your HashMap<Integer, …>.
Therefore your type would look like this:
TIntObjectHashMap<TIntDoubleHashMap> pairWise =
new TIntObjectHashMap<TIntDoubleHashMap>();
Now, do the math.
300.000.000 Doubles, each 24 bytes, use 7.200.000.000 bytes of memory, that is 7.2 GB.
If you store 300.000.000 doubles, taking 4 bytes each, you only need 1.200.000.000 bytes, which is 1.2 GB.
Congrats, you saved around 83% of the memory you previously used for storing your numbers!
Note that this calculation is rough, depends on the platform and implementation, and does not account for the memory used for the HashMap/T*Maps.
Your data set is large enough that holding all of it in memory at one time is not going to happen.
Consider storing the data in a database and loading partial data sets to perform manipulation.
Edit: My assumption was that you were going to do more than one pass on the data. If all you are doing is loading it and performing one action on each item, then Lex Webb's suggestion (comment below) is a better solution than a database. If you are performing more than one action per item, then database appears to be a better solution. The database does not need to be an SQL database, if your data is record oriented a NoSQL database might be a better fit.
You are using the wrong data structures for data of this volume. Java adds significant overhead in memory and time for every object it creates -- and at the 300 million object level you're looking at a lot of overhead. You should consider leaving this data in the file and use random access techniques to address it in place -- take a look at memory mapped files using nio.

Java Strings : how the memory works with immutable Strings

I have a simple question.
byte[] responseData = ...;
String str = new String(responseData);
String withKey = "{\"Abcd\":" + str + "}";
in the above code, are these three lines taking 3X memory. for example if the responseData is 1mb, then line 2 will take an extra 1mb in memory and then line 3 will take extra 1mb + xx. is this true? if no, then how it is going to work. if yes, then what is the optimal way to fix this. will StringBuffer help here?
Yes, that sounds about right. Probably even more because your 1MB byte array needs to be turned into UTF-16, so depending on the encoding, it may be even bigger (2MB if the input was ASCII).
Note that the garbage collector can reclaim memory as soon as the variables that use it go out of scope. You could set them to null as early as possible to help it make this as timely as possible (for example responseData = null; after you constructed your String).
if yes, then what is the optimal way to fix this
"Fix" implies a problem. If you have enough memory there is no problem.
the problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big,
If you don't, you have to think about a better alternative to keeping a 1MB string in memory. Maybe you can stream the data off a file? Or work on the byte array directly? What kind of data is this?
The problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big, thats why I need to figure it out first that am I doing something wrong ....
Yes. Well basically your fundamental problem is that you are trying to hold the entire string in memory at one time. This is always going to fail for a sufficiently large string ... even if you code it in the most optimal memory efficient fashion possible. (And that would be complicated in itself.)
The ultimate solution (i.e. the one that "scales") is to do one of the following:
stream the data to the file system, or
process it in such a way that you don't need ever need the entire "string" to be represented.
You asked if StringBuffer will help. It might help a bit ... provided that you use it correctly. The trick is to make sure that you preallocate the StringBuffer (actually a StringBuilder is better!!) to be big enough to hold all of the characters required. Then copy data into it using a charset decoder (directly or using a Reader pipeline).
But even with optimal coding, you are likely to need a peak of 3 times the size of your input byte[].
Note that your OOME problem is probably nothing to do with GC or storage leaks. It is actually about the fundamental space requirements of the data types you are using ... and the fact that Java does not offer a "string of bytes" data type.
There is no such OutOfMemoryException in my apidocs. If it's OutOfMemoryError, especially on the server-side, you definitely got a problem.
When you receive big requests from clients, those String related statements are not the first problem. Reducing 3X to 1X is not the solution.
I'm sorry I can't help without any further codes.
Use back-end storage
You should not store the whole request body on byte[]. You can store them directly on any back-end storage such as a local file, a remote database, or cloud storage.
I would
copy stream from request to back-end with small chunked buffer
Use streams
If can use Streams not Objects.
I would
response.getWriter().write("{\"Abcd\":");
copy <your back-end stored data as stream>);
response.getWriter().write("}");
Yes, if you use a Stringbuffer for the code you have, you would save 1mb of heap space in the last step. However, considering the size of data you have, I recommend an external memory algorithm where you bring only part of your data to memory, process it and put it back to storage.
As others have mentioned, you should really try not to have such a big Object in your mobile app, and that streaming should be your best solution.
That said, there are some techniques to reduce the amount memory your app is using now:
Remove byte[] responseData entirely if possible, so the memory it used can be released ASAP (assuming it is not used anywhere else)
Create the largest String first, and then substring() it, Android uses Apache Harmony for its standard Java library implementation. If you check its String class implementation, you'll see that substring() is implemented simply by creating a new String object with the proper start and end offset to the original data and no duplicate copy is created. So doing the following would cuts the overall memory consumption by at least 1/3:
String withKey = StringBuilder().append("{\"Abcd\").append(str).append("}").toString();
String str = withKey.substring("{\"Abcd\".length(), withKey.length()-"}".length());
Never ever use something like "{\"Abcd\":" + str + "}" for large Strings, under the hood "string_a"+"string_b" is implemented as new StringBuilder().append("string_a").append("string_b").toString(); so implicitly you are creating two (or at least one if the compiler is mart) StringBuilders. For large Strings, it's better that you take over this process yourself as you have deep domain knowledge about your program that the compiler doesn't, and knows how to best manipulate the strings.

Why is my hashset so memory-consuming?

I found out the memory my program is increasing is because of the code below, currently I am reading a file that is about 7GB big, and I believe the one that would be stored in the hashset is lesson than 10M, but the memory my program keeps increasing to 300MB and then crashes because of OutofMemoryError. If it is the Hashset problem, which data structure shall I choose?
if(tagsStr!=null) {
if(tagsStr.contains("a")||tagsStr.contains("b")||tagsStr.contains("c")) {
maTable.add(postId);
}
} else {
if(maTable.contains(parentId)) {
//do sth else, no memories added here
}
}
You haven't really told us what you're doing, but:
If your file is currently in something like ASCII, each character you read will be one byte in the file or two bytes in memory.
Each string will have an object overhead - this can be significant if you're storing lots of small strings
If you're reading lines with BufferedReader (or taking substrings from large strings), each one may have a large backing buffer - you may want to use maTable.add(new String(postId)) to avoid this
Each entry in the hash set needs a separate object to keep the key/hashcode/value/next-entry values. Again, with a lot of entries this can add up
In short, it's quite possible that you're doing nothing wrong, but a combination of memory-increasing factors are working against you. Most of these are unavoidable, but the third one may be relevant.
You've either got a memory leak or your understanding of the amount of string data that you are storing is incorrect. We can't tell which without seeing more of your code.
The scientific solution is to run your application using a memory profiler, and analyze the output to see which of your data structures is using an unexpectedly large amount of memory.
If I was to guess, it would be that your application (at some level) is doing something like this:
String line;
while ((line = br.readLine()) != null) {
// search for tag in line
String tagStr = line.substring(pos1, pos2);
// code as per your example
}
This uses a lot more memory than you'd expect. The substring(...) call creates a tagStr object that refers to the backing array of the original line string. Your tag strings that you expect to be short actually refer to a char[] object that holds all characters in the original line.
The fix is to do this:
String tagStr = new String(line.substring(pos1, pos2));
This creates a String object that does not share the backing array of the argument String.
UPDATE - this or something similar is an increasingly likely explanation ... given your latest data.
To expand on another of Jon Skeet's point, the overheads of a small String are surprisingly high. For instance, on a typical 32 bit JVM, the memory usage of a one character String is:
String object header for String object: 2 words
String object fields: 3 words
Padding: 1 word (I think)
Backing array object header: 3 words
Backing array data: 1 word
Total: 10 words - 40 bytes - to hold one char of data ... or one byte of data if your input is in an 8-bit character set.
(This is not sufficient to explain your problem, but you should be aware of it anyway.)
Couldn't be it possible that the data read into memory (from the 7G file) is somehow not freed? Something ike Jon puts... ie. since strings are immutable every string read requires a new String object creation which might lead to out of memory if GC is not quick enough...
If the above is the case than you might insert some 'breakpoints' into your code/iteration, ie. at some defined points, issue gc and wait till it terminates.
Run your program with -XX:+HeapDumpOnOutOfMemoryError. You'll then be able to use a memory analyser like MAT to see what is using up all of the memory - it may be something completely unexpected.

Categories