We are trying to run a Google Cloud Dataflow job in the cloud but we keep getting "java.lang.OutOfMemoryError: Java heap space".
We are trying to process 610 million records from a Big Query table and writing the processed records to 12 different outputs (main + 11 side outputs).
We have tried increasing our number of instances to 64 n1-standard-4 instances but we are still getting the issue.
The Xmx value on the VMs seem to be set at ~4GB(-Xmx3951927296), even though the instances have 15GB memory. Is there any way of increasing the Xmx Value?
The job ID is - 2015-06-11_21_32_32-16904087942426468793
You can't directly set the heap size. Dataflow, however, scales the heap size with the machine type. You can pick a machine with more memory by setting the flag "--machineType". The heap size should increase linearly with the total memory of the machine type.
Dataflow deliberately limits the heap size to avoid negatively impacting the shuffler.
Is your code explicitly accumulating values from multiple records in memory? Do you expect 4GB to be insufficient for any given record?
Dataflow's memory requirements should scale with the size of individual records and the amount of data your code is buffering in memory. Dataflow's memory requirements shouldn't increase with the number of records.
Related
I have two questions on how large the heap size in a Cassandra and Geode JVM will be using the new parameter XX:maxRAMPercentage in a kubernetes cluster under the following conditions:
1)
My JVM container has only (in the helm chart) Memory Request set but Memory Limit NOT set in order to obtain Best Effort memory use. Since XX:maxRAMPercentage is in relation to the Limit, and there is no limit set, then what heap size will be obtained?
2)
My JVM container has NOT defined (in the helm chart) Memory Request and Memory Limit in order to obtain Best Effort memory use. Since XX:maxRAMPercentage is in relation to the Limit, and there is no limit set, then what heap size will be obtained?
BR,
Thomas
If no limit is set, than the RAM of the node your pod is running on will be taken (your pod is using the kernel from the node).
XX:maxRAMPercentage is not very practical for running in cluster because either you set the limit, and are not able to use it fully, or not specify the limit, and get the limit from the node, which is usually not something you can control.
I am running Hive insert overwrite query on the Google dataproc cluster from a table having
13783531
records to the another partitioned table without any transformation.
which fails with the error
Diagnostic Messages for this Task:
Error: Java heap space
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 34 Cumulative CPU: 1416.18 sec HDFS Read: 6633737937
HDFS Write: 0 FAIL
cluster details
n1-standard-16 (16 vCPU, 60.0 GB memory)
with 5 worker nodes.
The error varies between
Java heap space and GC overhead limit exceeded.
I tried setting the param
set mapreduce.map.memory.mb=7698;
set mapreduce.reduce.memory.mb=7689;
set mapreduce.map.java.opts=-Xmx7186m;
set mapreduce.reduce.java.opts=-Xmx7186m;
Still Fails.
So the issue was insert overwrite was trying to create too many small files.
seems we have a fix
set hive.optimize.sort.dynamic.partition=true;
https://community.hortonworks.com/articles/89522/hive-insert-to-dynamic-partition-query-generating.html
There are two Solution available both of them worked
1. use set hive.optimize.sort.dynamic.partition=true;
or
2. use DISTRIBUTE BY <PARTITION_COLUMN>
any of these will work.
It is better not to use Solution #1.Seems the JIRA says it inserts records into the wrong partition when used with GROUP BY
that is why it was disabled by default in the recent hive
https://issues.apache.org/jira/browse/HIVE-8151
There's a couple of things you need to address here:
Total JVM memory allocated vs. JVM heap memory
The total JVM memory allocated is set through these parameters:
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
The JVM heap memory is set through these parameters:
mapreduce.map.java.opts
mapreduce.reduce.java.opts
You must always ensure that Total memory > heap memory. (Notice that this rule is violated in the parameter values you provided)
Total-to-heap ratio
One of our vendors recommended that we should, for the most part, always use roughly 80% of the total memory for heap. Even with this recommendation you will often encounter various memory errors.
Error: heap memory
Probably need to increase both total and heap.
Error: Permgen space not enough
Need to increase the off-heap memory which means you might be able to decrease the heap memory without having to increase the total memory.
Error: GC overhead limit exceeded
This refers to the amount of time that the JVM is allowed to garbage collect. If too little space is received in a very long time, then it will proceed to error out. Try increasing both total and heap memory.
I'm reading data from IO having huge volume of data and I need to store the data in key value pair in Map or properties file, then only I can use that data for generating reports. But when I am storing this huge data in Map or Properties file, Heap Memory Exception is coming.Instead, if I am using SQLLite its taking very huge time to retrieve that. Is there any different way available to achieve this.Please suggest.
Java Heap Space Important Points
Java Heap Memory is part of memory allocated to JVM by Operating System.
Whenever we create objects they are created inside Heap in Java.
Java Heap space is divided into three regions or generation for sake of garbage collection called New Generation, Old or tenured Generation or Perm Space. Permanent generation is garbage collected during full gc in hotspot JVM.
You can increase or change size of Java Heap space by using JVM command line option -Xms, -Xmx and -Xmn. don't forget to add word "M" or "G" after specifying size to indicate Mega or Gig.
For example you can set java heap size to 258MB by executing following command java -Xmx256m javaClassName (your program class name).
You can use either JConsole or Runtime.maxMemory(), Runtime.totalMemory(), Runtime.freeMemory() to query about Heap size programmatic in Java.
You can use command "jmap" to take Heap dump in Java and "jhat" to analyze that heap dump.
Java Heap space is different than Stack which is used to store call hierarchy and local variables.
Java Garbage collector is responsible for reclaiming memory from dead object and returning to Java Heap space.
Don’t panic when you get java.lang.OutOfMemoryError, sometimes it’s just matter of increasing heap size but if it’s recurrent then look for memory leak in Java.
Use Profiler and Heap dump Analyzer tool to understand Java Heap space and how much memory is allocated to each object.
Reference link for more details:
https://docs.oracle.com/cd/E19159-01/819-3681/abeii/index.html
https://docs.oracle.com/cd/E40520_01/integrator.311/integratoretl_users/src/ti_troubleshoot_memory_errors.html
You need to do a rough estimate of of memory needed for your map. How many keys and values? How large are keys and values? For example, if the keys are longs and values are strings 40 characters long on average, the absolute minimum for 2 billion key-value pairs is (40 + 8) * 2E9 - approximately 100 GB. Of course, the real requirement is larger than the minimum estimate - as much as two times larger depending on the nature of the keys and values.
If the estimated amount of memory beyond reasonable (100 GB is beyond reasonable unless you have lots of money) you need to figure out a way to partition your processing. You need to read in a large chunk of data, then run some algorithm on it to reduce it to some small size. Then do it for all other chunks one by one, making sure to not keeping the old chunk around when you process the new chunk. Finally, look at the results from all chunks and compute the final result. For a better description of this approach, look up "map-reduce'.
If the estimated amount of memory is somewhat reasonable (say, 8 GB - and you have a 16 GB machine) - use 64 bit JVM, set the maximum heap memory using -Xmx switch, make sure you use the most efficient data structures such as Trove maps.
Good luck!
Increasing the heap size is one option but there is an alternative to store data off heap by using memory mapped files in java.You can refer this
I have always given an assumed heap size to my application and while the app is running, I monitor and modify / tune the heap size.
Is there a way in which I can calculate the initial heap required more or less accurately.
For the best performance in a Java EE style environment, i.e. when the application is meant to be running for very long periods of time (months or weeks), then it is best to set the minimum and maximum heap size to be the same.
The best way to size the heap in this case is to gather data on how your application runs over time. With a weeks worth of verbose GC log, we can import that data into GCViewer. Looking at the troughs of the graph, we can take an average and see the minimum retained set after each garbage collection. This is the amount of data, on average, kept in the heap for normal running. So any heap size should be at least that level. Since we're setting minimum and maximum to the same here, we need to add more space to compensate for spikes. How much to add depends on your use case, but somewhere between 25-35% is a start.
With this method, always remember to keep monitoring your heap and GC behaviour using verbose GC (which is recommended by Oracle to run even in production).
Important: This is assuming that your application is an always-on type of application. If you're considering a desktop java application, then the behaviour will be very very different and this should not be seen as a reliable method in that case. As #juhist said in his/her answer, just the maximum should be set and the JVM will handle the rest.
Why do you monitor and modify / tune the heap size?
Java has been designed in a way that the heap automatically grows to the needed size as long as the maximum heap size is large enough. You shouldn't modify the heap size manually at all.
I would rely on the automatic Java algorithms for adjusting the heap size and just specify the maximum heap size at program startup. You should take into account how many java processes you're running in parallel when deciding the maximum heap size.
I needs Java code to monitor Redis's memory usage because Redis stores all data into RAM and it will crash if the memory is full.
It looks like Redis uses the whole OS memory, so if I use "Runtime" method in Java, it is no correct because it only counts the memory in JVM.
Is there any Java method to monitor the whole OS system's memory usage or these is some magic Redis method?
You could make periodic requests to redis, sending an INFO command, and parse the result to get the value of used_memory, which is the number of bytes allocated by Redis memory allocator.
update: Redis won't crash, it will swap - and so its performances will dramatically fall. You may detect swapping by comparing used_memory_rss to used_memory. used_memory_rss much greater than used_memory means swapping occured. But before that you can be wanted swapping will occur if used_memory is just below the total memory available for Redis.
If you are using redis as a cache, you may limit its memory consumption by adding these lines in the config file :
maxmemory 2mb
maxmemory-policy allkeys-lru
In this example it will be limited to 2 Mb.
update
maxmemory will prevent new write operations when the limit is reached, and respond with an error; and it will start to delete keys according to the LRU policy, which is appropriated for a cache.