My logic is as follows.
Use createDirectStream to get a topic by log type in Kafka.
After repartition, the log is processed through various processing.
Create a single string using combineByKey for each log type (use StringBuilder).
Finally, save to HDFS by log type.
There are a lot of operations that add strings, so GC happens frequently.
How is it better to set up GC in this situation?
//////////////////////
There are various logic, but I think there is a problem in doing combineByKey.
rdd.combineByKey[StringBuilder](
(s: String) => new StringBuilder(s),
(sb: StringBuilder, s: String) => sb.append(s),
(sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)
The simplest thing with greatest impact you can do with that combineByKey expression is to size the StringBuilder you create so that it does not have to expand its backing character array as you merge string values into it; the resizing amplifies the allocation rate and wastes memory bandwidth by copying from old to new backing array. As a guesstimate, I would say pick the 90th percentile of string length of the resulting data set's records.
A second thing to look at (after collecting some statistics on your intermediate values) would be for your combiner function to pick the StringBuilder instance that has room to fit in the other one when you call sb1.append(sb2).
A good thing to take care of would be to use Java 8; it has optimizations that make a significant difference when there's heavy work on strings and string buffers.
Last but not least, profile to see where you are actually spending your cycles. This workload (excluding any additional custom processing you are doing) shouldn't need to promote a lot of objects (if any) to old generation, so you should make sure that young generation has ample size and is collected in parallel.
Related
I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?
tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).
I have an application that uses a ton of String objects. One of my objects (lets call it Person) contains 9 of them. The data that is written to each String object is never written more than once, but will be read several times after. There will be several hundred thousand or so Person objects at a given time and many of these Person objects will share first name, last name, etc...
I am trying to think of immediate ways to reduce the amount memory that is consumed by the Person object but I am no expert when it comes to how Java manages its memory underneath.
Before I go down this rabbit hole, I would like to know what drawbacks there would be if I went down these paths and if it even make sense in the first place:
Using StringBuilder or StringBuffer solely because of the trimToSize() method which would allow me to reduce the number of allocated bytes used in the string.
Store the strings as byte[] array's and provide a getter that would convert the byte[] to String and a setter that would accept String and convert to byte[] - data is being read quite a bit, so would this be too expensive?
Create a hash table for (lets just say) "names" that would prevent duplicate allocations (using a pointer) for the same name over and over (there could be thousands of names with 10+ characters).
Before I pointlessly head down any of these roads, does it make sense to do? Maybe Java is already reducing String allocations and checking for duplicates?
I don't mind a good read either. I have found some documentation but nothing that explores to this depth.
Obviously StringBuilder and StringBuffer couldn't help in this case. String is immutable object, so these 2 classes were introduced for building Strings not for storing. Anyway you may (in most cases - must) use StringBuilder if you concatinate/insert chars in the middle/delete some chars from/of Strings
In my opinion, second option could led to increasing memory consuption because new String will be created when byte[] will be converted to String every time you need it.
Handwritten StringDeduplicator is very reasonable solution, especially if you are stuck with java 5,6,7.
Java 8/9 has String Deduplication option. By default, this option is disabled. To use this one in Java 8, you must enable the G1 garbage collector, while in Java 9 G1 is the default.
-XX:+UseStringDeduplication
Regarding String Deduplication, see:
JEP 192: String Deduplication in G1
Java 8 Update 20 Release Notes
Other Stack Overflow posts
I have a data structure:
ArrayList<String>[] a = new ArrayList[100000];
each list has about 1000 strings with about 100 characters.
I'm doing an one-off job with it, and it cost a little more memory than I can bear.
I think I can change less code if I can find ways to reduce some memory cost , as the cost is not too much , and it's just an one-off job. So, please tell me all possible ways you know.
add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now. But I don't know the size of each arraylist before I work through all the data.
And the problem is indeed too much data, so I want to find ways to compress it. It's not a allocation problem. There will finally be too much data to exceed the memory.
it cost a little more memory than I can bear
So, how much is "a little"?
Some quick estimates:
You have collections of string of 1000x100 characters. That should be about 1000x100x2 = 200kb of string data.
If you have 100000 of those, you'll need almost 20Gb for the data alone.
Compared to the 200kb of each collection's data the overhead of your data structures is miniscule, even if it was 100 bytes for each collection (0.05%).
So, not much to be gained here.
Hence, the only viable ways are:
Data compression of some kind to reduce the size of the 20Gb payload
Use of external storage, e.g. by only reading in strings which are needed at the moment and then discarding them
To me, it is not clear if your memory problem really comes from the data structure you showed (did you profile the program?) or from the total memory usage of the program. As I commented on another answer, resizing an array(list) for instance temporarily requires at least 2x the size of the array(list) for the copying operation. Then notice that you can create memory leaks in Java - or just be holding on to data you actually won't need again.
Edit:
A String in Java consists of an array of chars. Every char occupies two bytes.
You can convert a String to a byte[], where any ASCII character should need one byte only (non-ASCII characters will still need 2 (or more) bytes):
str.getBytes(Charset.forName("UTF-8"))
Then you make a Comparator for byte[] and you're good to go. (Notice though that byte has a range of [-128,127] which makes comparing non-intuitive in this case; you may want to compare (((int)byteValue) & 0xff).)
Why are you using Arrays when you don't know the size at compile time itself, Size is the main concern why Linked lists are preferable over arrays
ArrayList< String>[] a = new ArrayList[100000];
Why are you allocating so much memory at once initially, ArrayList will resize itself whenever required you need not do it, manually.
I think below structure will suffice your requirement
List<List<String> yourListOfStringList = new ArrayList<>();
My java class reads in a 60MB file and produces a HashMap of a HashMap with over 300 million records.
HashMap<Integer, HashMap<Integer, Double>> pairWise =
new HashMap<Integer, HashMap<Integer, Double>>();
I already tunned the VM argument to be:
-Xms512M -Xmx2048M
But system still goes for:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.createEntry(HashMap.java:869)
at java.util.HashMap.addEntry(HashMap.java:856)
at java.util.HashMap.put(HashMap.java:484)
at com.Kaggle.baseline.BaselineNew.createSimMap(BaselineNew.java:70)
at com.Kaggle.baseline.BaselineNew.<init>(BaselineNew.java:25)
at com.Kaggle.baseline.BaselineNew.main(BaselineNew.java:315)
How big of the heap will it take to run without failing with an OOME?
Your dataset is ridiculously large to process it in memory, this is not a final solution, just an optimization.
You're using boxed primitives, which is a very painful thing to look at.
According to this question, a boxed integer can be 20 bytes larger than an unboxed integer. This is not what I call memory efficient.
You can optimize this with specialized collections, which don't box the primitive values. One project providing these is Trove. You could use a TIntDoubleMap instead of your HashMap<Integer, Double> and a TIntObjectHashMap instead of your HashMap<Integer, …>.
Therefore your type would look like this:
TIntObjectHashMap<TIntDoubleHashMap> pairWise =
new TIntObjectHashMap<TIntDoubleHashMap>();
Now, do the math.
300.000.000 Doubles, each 24 bytes, use 7.200.000.000 bytes of memory, that is 7.2 GB.
If you store 300.000.000 doubles, taking 4 bytes each, you only need 1.200.000.000 bytes, which is 1.2 GB.
Congrats, you saved around 83% of the memory you previously used for storing your numbers!
Note that this calculation is rough, depends on the platform and implementation, and does not account for the memory used for the HashMap/T*Maps.
Your data set is large enough that holding all of it in memory at one time is not going to happen.
Consider storing the data in a database and loading partial data sets to perform manipulation.
Edit: My assumption was that you were going to do more than one pass on the data. If all you are doing is loading it and performing one action on each item, then Lex Webb's suggestion (comment below) is a better solution than a database. If you are performing more than one action per item, then database appears to be a better solution. The database does not need to be an SQL database, if your data is record oriented a NoSQL database might be a better fit.
You are using the wrong data structures for data of this volume. Java adds significant overhead in memory and time for every object it creates -- and at the 300 million object level you're looking at a lot of overhead. You should consider leaving this data in the file and use random access techniques to address it in place -- take a look at memory mapped files using nio.
I'm having this problem: I'm reading 900 files and, after processing the files, my final output will be an HashMap<String, <HashMap<String, Double>>. First string is fileName, second string is word and the double is word frequency. The processing order is as follows:
read the first file
read the first line of the file
split the important tokens to a string array
copy the string array to my final map, incrementing word frequencies
repeat for all files
I'm using string BufferedReader. The problem is, after processing the first files, the Hash becomes so big that the performance is very low after a while. I would like to hear solution for this. My idea is to create a limited hash, after the limit reached store into a file. do that until everything is processed, mix all the hashs at the end.
Why not just read one file at a time, and dump that file's results to disk, then read the next file etc? Clearly each file is independent of the others in terms of the mapping, so why keep the results of the first file while you're writing the second?
You could possibly write the results for each file to another file (e.g. foo.txt => foo.txt.map), or you could create a single file with some sort of delimiter between results, e.g.
==== foo.txt ====
word - 1
the - 3
get - 3
==== bar.txt ====
apple - 2
// etc
By the way, why are you using double for the frequency? Surely it should be an integer value...
The time for a hash map to process shouldn't increase significantly as it grows. It is possible that your map is skewing because of an unsuited hashing function or filling up too much. Unless you're using more RAM than you can get from the system, you shouldn't have to break things up.
What I have seen with Java when running huge hash maps (or any collection) with a lots of objects in memory is that the VM goes crazy trying to run the garbage collector. It gets to the point where 90% of the time is spent with the JVM kicking off the garbage collector which takes a while and finds almost every object has a reference.
I suggest profiling your application, and if it is the garbage collector, then increasing heap space and tuning the garbage collector. Also, it will help if you can approximate the needed size of your hash maps and provide sufficiently large allocations (see initialCapacity and loadFactor options in the constructor).
I am trying to rethink your problem:
Since you are trying to construct an inverted index:
Use Multimap rather then Map<String, Map<String, Integer>>
Multimap<word, frequency, fileName, .some thing else tomorrow>
Now, read one file, construct the Multimap and save it on disk. (similar to Jon's answer)
After reading x files, merge all the Multimaps together: putAll(multimap) if you really need one common map of all the values.
You could try using this library to improve your performance.
http://high-scale-lib.sourceforge.net/
It is similar to the java collections api, but for high performance. It would be ideal if you can batch and merge these results after processing them in small batches.
Here is an article that will help you with some more inputs.
http://www.javaspecialists.eu/archive/Issue193.html
Why not use a custom class,
public class CustomData {
private String word;
private double frequency;
//Setters and Getters
}
and use your map as
Map<fileName, List<CustomData>>
this way atleast you will have only 900 keys in your map.
-Ivar