Error: java heap space on Google Data-Proc Cluster

Error: java heap space on Google Data-Proc Cluster - java

I am running Hive insert overwrite query on the Google dataproc cluster from a table having
13783531
records to the another partitioned table without any transformation.
which fails with the error
Diagnostic Messages for this Task:
Error: Java heap space
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 34 Cumulative CPU: 1416.18 sec HDFS Read: 6633737937
HDFS Write: 0 FAIL
cluster details
n1-standard-16 (16 vCPU, 60.0 GB memory)
with 5 worker nodes.
The error varies between
Java heap space and GC overhead limit exceeded.
I tried setting the param
set mapreduce.map.memory.mb=7698;
set mapreduce.reduce.memory.mb=7689;
set mapreduce.map.java.opts=-Xmx7186m;
set mapreduce.reduce.java.opts=-Xmx7186m;
Still Fails.

So the issue was insert overwrite was trying to create too many small files.
seems we have a fix
set hive.optimize.sort.dynamic.partition=true;
https://community.hortonworks.com/articles/89522/hive-insert-to-dynamic-partition-query-generating.html
There are two Solution available both of them worked
1. use set hive.optimize.sort.dynamic.partition=true;
or
2. use DISTRIBUTE BY <PARTITION_COLUMN>
any of these will work.
It is better not to use Solution #1.Seems the JIRA says it inserts records into the wrong partition when used with GROUP BY
that is why it was disabled by default in the recent hive
https://issues.apache.org/jira/browse/HIVE-8151

There's a couple of things you need to address here:
Total JVM memory allocated vs. JVM heap memory
The total JVM memory allocated is set through these parameters:
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
The JVM heap memory is set through these parameters:
mapreduce.map.java.opts
mapreduce.reduce.java.opts
You must always ensure that Total memory > heap memory. (Notice that this rule is violated in the parameter values you provided)
Total-to-heap ratio
One of our vendors recommended that we should, for the most part, always use roughly 80% of the total memory for heap. Even with this recommendation you will often encounter various memory errors.
Error: heap memory
Probably need to increase both total and heap.
Error: Permgen space not enough
Need to increase the off-heap memory which means you might be able to decrease the heap memory without having to increase the total memory.
Error: GC overhead limit exceeded
This refers to the amount of time that the JVM is allowed to garbage collect. If too little space is received in a very long time, then it will proceed to error out. Try increasing both total and heap memory.

Related

How come Java can allocate more memory than the specified heap size

We are basically tuning our JVM options.
-J-Xms1536M -J-Xmx1536M -J-Xss3M -J-Djruby.memory.max=1536M -J-Djruby.thread.pool.enabled=true -J-Djruby.compile.mode=FORCE -J-XX:NewRatio=3 -J-XX:NewSize=256M -J-XX:MaxNewSize=256M -J-XX:+UseParNewGC -J-XX:+CMSParallelRemarkEnabled -J-XX:+UseConcMarkSweepGC -J-XX:CMSInitiatingOccupancyFraction=75 -J-XX:+UseCMSInitiatingOccupancyOnly -J-XX:SurvivorRatio=5 -J-server -J-Xloggc:/home/deploy/gcLog/gc.log -J-XX:+PrintGCDateStamps -J-XX:+PrintGCDetails -J-XX:+PrintGCApplicationStoppedTime -J-XX:+PrintSafepointStatistics -J-XX:PrintSafepointStatisticsCount=1
We have set the -J-Xmx1536 and -J-Xms1536M to a value of 1536M. Now
If I understood this correctly -J-Xmx represent the maximum size of the heap.
The system is 4 core 15GB ram process.
But when I check the RSS(using top) of my running Java process I see it is consuming a value larger than the -JXmx1536 around ~2GB.
Now clearly, the JVM heap has increased beyond the specified value of -Jmx.
So my question are..
Why? am I not seeing any Java out of memory exception.
And what is an ideal setting for -JXmx with 4 cores and 15GB RAM.(given that no other process is running in the system other than Java application)

Why? am I not seeing any Java out of memory exception.
because you did not run out of heap memory, start VisualVM and examine the process after setting -Xmx. you'll notice there'sa region called MetaSpace (1G max by default) besides that there are also ways in which the process might use additional memory e.g. for code-cache (JIT-ed native code)
And what is an ideal setting for -JXmx with 4 cores and 15GB RAM.(given that no other process is running in the system other than Java application)
there's no "clear" answer for that, it depends from application to application, you should monitor your memory usage under various scenarios. first thing to do might be to set heap high but if you're not using up most of it and you have a memory leak it will complicate things.

When java GC call on 64 bit system if max heap size not specified

I am using 64bit system with 64gb RAM for processing of large data.
While processing data i did not provide any heap max size parameter.
My program continuously consuming memory and it is not calling java GC.
To process data only 2GB memory is needed.
Can anyone give detail when GC called by JVM? or is it Necessary to provide heap max size parameter?

I am using 64bit system with 64gb RAM for processing of large data. While processing data i did not provide any heap max size parameter.
The default heap size for a 64-bit JVM is 1/4 of main memory or 16 GB in your case.
My program continuously consuming memory and it is not calling java GC.
I would be very surprise if the GC is not running if you didn't tune it at all regardless of the maximum size.
To process data only 2GB memory is needed.
You should be having minor GC collections, unless you only use large arrays and very little small objects. i.e. your eden space never fills up.
Can anyone give detail when GC called by JVM?
When the current eden size is reached, it performs a minor GC. If the GC decides more memory is needed and the maximum hasn't been reached it will increase the heap available.
or is it Necessary to provide heap max size parameter?
Only if you want it to use less memory (but this is likely to be slower)

If you want to limit the amount of memory that the JVM uses, it is necessary to provide a max heapsize option. If you don't do this, the JVM will use a default heap size, which may be half the available RAM (depending on your JVM type and version).
Can anyone give detail when GC called by JVM?
It is called when the JVM thinks it is necessary. Typically, that is when the Eden space (where most new objects are created) is too full. It can also happen if you attempt to allocate an object that is too big for Eden space anyway, and the tenured space there the object needs to be allocated is too full. The actual decision making is dependent on the kind of GC your JVM is using. (And it is opaque; i.e. not specified / documented in any published documentation. And probably irrelevant to the problem you are actually trying to solve.)
The GC does not have minimize memory usage as a primary goal for optimizing performance. Rather it will attempt to either optimize for minimum GC pause time, or optimize for maximum throughput.
So, when the GC runs, the JVM is reluctant to reduce the size of the heap. This is because the GC tends to run more efficiently if the heap size is significantly larger than your application's working set of reachable objects.

How to avoid Java Heap Space Exception in java

I'm reading data from IO having huge volume of data and I need to store the data in key value pair in Map or properties file, then only I can use that data for generating reports. But when I am storing this huge data in Map or Properties file, Heap Memory Exception is coming.Instead, if I am using SQLLite its taking very huge time to retrieve that. Is there any different way available to achieve this.Please suggest.

Java Heap Space Important Points
Java Heap Memory is part of memory allocated to JVM by Operating System.
Whenever we create objects they are created inside Heap in Java.
Java Heap space is divided into three regions or generation for sake of garbage collection called New Generation, Old or tenured Generation or Perm Space. Permanent generation is garbage collected during full gc in hotspot JVM.
You can increase or change size of Java Heap space by using JVM command line option -Xms, -Xmx and -Xmn. don't forget to add word "M" or "G" after specifying size to indicate Mega or Gig.
For example you can set java heap size to 258MB by executing following command java -Xmx256m javaClassName (your program class name).
You can use either JConsole or Runtime.maxMemory(), Runtime.totalMemory(), Runtime.freeMemory() to query about Heap size programmatic in Java.
You can use command "jmap" to take Heap dump in Java and "jhat" to analyze that heap dump.
Java Heap space is different than Stack which is used to store call hierarchy and local variables.
Java Garbage collector is responsible for reclaiming memory from dead object and returning to Java Heap space.
Don’t panic when you get java.lang.OutOfMemoryError, sometimes it’s just matter of increasing heap size but if it’s recurrent then look for memory leak in Java.
Use Profiler and Heap dump Analyzer tool to understand Java Heap space and how much memory is allocated to each object.
Reference link for more details:
https://docs.oracle.com/cd/E19159-01/819-3681/abeii/index.html
https://docs.oracle.com/cd/E40520_01/integrator.311/integratoretl_users/src/ti_troubleshoot_memory_errors.html

You need to do a rough estimate of of memory needed for your map. How many keys and values? How large are keys and values? For example, if the keys are longs and values are strings 40 characters long on average, the absolute minimum for 2 billion key-value pairs is (40 + 8) * 2E9 - approximately 100 GB. Of course, the real requirement is larger than the minimum estimate - as much as two times larger depending on the nature of the keys and values.
If the estimated amount of memory beyond reasonable (100 GB is beyond reasonable unless you have lots of money) you need to figure out a way to partition your processing. You need to read in a large chunk of data, then run some algorithm on it to reduce it to some small size. Then do it for all other chunks one by one, making sure to not keeping the old chunk around when you process the new chunk. Finally, look at the results from all chunks and compute the final result. For a better description of this approach, look up "map-reduce'.
If the estimated amount of memory is somewhat reasonable (say, 8 GB - and you have a 16 GB machine) - use 64 bit JVM, set the maximum heap memory using -Xmx switch, make sure you use the most efficient data structures such as Trove maps.
Good luck!

Increasing the heap size is one option but there is an alternative to store data off heap by using memory mapped files in java.You can refer this

Cloud Dataflow - Increase JVM Xmx Value

We are trying to run a Google Cloud Dataflow job in the cloud but we keep getting "java.lang.OutOfMemoryError: Java heap space".
We are trying to process 610 million records from a Big Query table and writing the processed records to 12 different outputs (main + 11 side outputs).
We have tried increasing our number of instances to 64 n1-standard-4 instances but we are still getting the issue.
The Xmx value on the VMs seem to be set at ~4GB(-Xmx3951927296), even though the instances have 15GB memory. Is there any way of increasing the Xmx Value?
The job ID is - 2015-06-11_21_32_32-16904087942426468793

You can't directly set the heap size. Dataflow, however, scales the heap size with the machine type. You can pick a machine with more memory by setting the flag "--machineType". The heap size should increase linearly with the total memory of the machine type.
Dataflow deliberately limits the heap size to avoid negatively impacting the shuffler.
Is your code explicitly accumulating values from multiple records in memory? Do you expect 4GB to be insufficient for any given record?
Dataflow's memory requirements should scale with the size of individual records and the amount of data your code is buffering in memory. Dataflow's memory requirements shouldn't increase with the number of records.

Can you explain how the memory is used under multi-threaded java program?

Consider a multi-threaded java program with 10 threads. Its heap size is set to 128M max. When the application is executed, in windows / Linux, it’s showing usage of 160 MB. Can you explain how the memory is used ?

The max heap size of a process is only the amount of memory it can allocate to global variables etc. The heap can not exceed this limit, but there are other parts of the process that will get additional memory from the OS.
Every single thread will have it's own stack allocated (often 4 or 8 KB per thread but that can also be tweaked) where all local variables, method parameters and return addresses are pushed and pulled over the lifetime of the thread.
In addition the JVM garbage collector will also be given a certain amount of memory to deal with the chaos left in your program's wake.
You can read more about JVM tweaking and what is does here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.