I'm using Apache Spark 2.0.2 together with Apache JobServer 0.7.0.
I know this is not a best practice but this is a first step. My server have 52 Gb RAM and 6 CPU Cores, Cent OS 7 x64, Java(TM) SE Runtime Environment (build 1.7.0_79-b15) and it have the following running applications with the specified memory configuration.
JBoss AS 7 (6 Gb)
PDI Pentaho 6.0 (12 Gb)
MySQL (20 Gb)
Apache Spark 2.0.2 (8 Gb)
I start it and everything works as expected. And works so for several hours.
I have a jar with 2 implemented jobs who extends from My_Job class.
public class VIQ_SparkJob extends JavaSparkJob {
protected SparkSession sparkSession;
protected String TENANT_ID;
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
sparkSession = SparkSession.builder()
.sparkContext(ctx)
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer", "8m")
.getOrCreate();
Class<?>[] classes = new Class<?>[2];
classes[0] = UsersCube.class;
classes[1] = ImportCSVFiles.class;
sparkSession.sparkContext().conf().registerKryoClasses(classes);
TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");
return true;
}
#Override
public SparkJobValidation validate(SparkContext sc, Config config) {
return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}
}
This Job import the data from some .csv files and store them as parquet files partitioned by tenant. Are 2 entities users which ocupe 674 Mb in disk as a parquet files and user_processed with 323 Mb.
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
super.runJob(jsc, jobConfig);
String entity= jobConfig.getString("entity");
Dataset<Row> ds = sparkSession.read()
.option("header", "true")
.option("inferschema", true)
.csv(csvPath);
ds.withColumn("tenant_id", ds.col("tenant_id").cast("int"))
.write()
.mode(SaveMode.Append)
.partitionBy(JavaConversions.asScalaBuffer(asList("tenant_id")))
.parquet("/value_iq/spark-warehouse/"+entity);
return null;
}
This one is to query the parquet files:
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
super.runJob(jsc, jobConfig); //To change body of generated methods, choose Tools | Templates.
String query = jobConfig.getString("query");
Dataset<Row> lookup_values = getDataFrameFromMySQL("value_iq", "lookup_values").filter(new Column("lookup_domain").equalTo("customer_type"));
Dataset<Row> users = getDataFrameFromParket(USERS + "/tenant_id=" + TENANT_ID);
Dataset<Row> user_profiles = getDataFrameFromParket(USER_PROCESSED + "/tenant_id=" + TENANT_ID);
lookup_values.createOrReplaceTempView("lookup_values");
users.createOrReplaceTempView("users");
user_profiles.createOrReplaceTempView("user_processed");
//CREATING VIEWS DE AND EN
sparkSession
.sql(Here I join the 3 datasets)
.coalesce(200)
.createOrReplaceTempView("cube_users_v_de");
List<String> list = sparkSession.sql(query).limit(1000).toJSON().takeAsList(1000);
String result = "[";
for (int i = 0; i < list.size(); i++) {
result += (i == 0 ? "" : ",") + list.get(i);
}
result += "]";
return result;
}
Every day I run the first job saving to parquet files some csv. And during the day I execute some queries to the second one. But after some hours crash because of out of memory this the error log:
k.memory.TaskMemoryManager [] [akka://JobServer/user/context-supervisor/application_analytics] - Failed to allocate a page (8388608 bytes), try again.
WARN .netty.NettyRpcEndpointRef [] [] - Error sending message [message = Heartbeat(0,[Lscala.Tuple2;#18c54652,BlockManagerId(0, 157.97.107.42, 55223))] in 1 attempt
I have the master and one worker in this server. My spark-defaults.conf
spark.debug.maxToStringFields 256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true
This is my Spark Jobserver settings.sh
DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8
I create my context with the following curl:
curl -k --basic --user 'user:password' -d "" 'https://localhost:4810/contexts/application?num-cpu-cores=5&memory-per-node=8G'
And the spark driver use 2Gb.
The created application looks like
ExecutorID Worker Cores Memory State Logs
0 worker-20170203084218-157.97.107.42-50199 5 8192 RUNNING stdout stderr
Those are my executors
Executor ID Address ▴ Status RDD Blocks Storage Memory Disk Used Cores
driver 157.97.107.42:55222 Active 0 0.0 B / 1018.9 MB 0.0 B 0
0 157.97.107.42:55223 Active 0 0.0 B / 4.1 GB 0.0 B 5
I have a process who checks the memory used per process and the top amount was 8468 Mb.
There are 4 processes related with spark.
The master process. Start with 1Gb memory assigned, I don't know from where this configuration cames. But seems to be enough. Use only 0.4 Gb at top.
The worker process. The same as the master with the memory use.
The driver process. Who have 2Gb configured.
The context. Who have 8Gb configured.
In the following table you can see how the memory used by the driver and contexts behaves. After getting the java.lang.OutOfMemoryError: Java heap space. The context fails, but the driver acept another context, so it remains fine.
system_user | RAM(Mb) | entry_date
--------------+----------+---------------------
spark.driver 2472.11 2017-02-07 10:10:18 //Till here everything was fine
spark.context 5470.19 2017-02-07 10:10:18 //it was running for more thant 48 hours
spark.driver 2472.11 2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context 0.00 2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
//in $LOG_FOLDER/job-server-master/server_startup.log
# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed
spark.driver 2472.11 2017-02-07 10:16:18 //Here I have deleted and created again
spark.context 105.20 2017-02-07 10:16:18
spark.driver 2577.30 2017-02-07 10:19:18 //Here I execute the three big
spark.context 3734.46 2017-02-07 10:19:18 //concurrent queries again.
spark.driver 2577.30 2017-02-07 10:20:18 //Here after the queries where
spark.context 5154.60 2017-02-07 10:20:18 //executed. No memory issue.
I have 2 questions:
1- Why when I check the spark GUI my driver who has 2 configured Gb only use 1, the same with the executor 0 which only use 4.4 Gb. Where goes the other configured memory? But when the processes in the system the driver it use 2Gb.
2- If I have enough memory on the server then why I'm out of memory?
Related
Could someone please make a suggestion here? I am executing the following code using the spark submit command, which is taking data from the src hive table (hive is taking data from the spark streaming job and the source is Kafka) every 5 minutes and doing some aggregation over the last 2 hours of partition at the target side.
val sparksql="insert OVERWRITE table hivedesttable(partition_date,partition_hour)
select "some business logic with aggregation and group by condition" from hivesrctable
where concat(partitionDate,":",partitionHour) in ${partitionDateHour}"
new Thread(new Runnable {
override def run(): Unit = {
while (true) {
val currentTs = java.time.LocalDateTime.now
var partitionDateHour = (0 until 2)
.map(h => currentTs.minusHours(h))
.map(ts => s"'${ts.toString.substring(0, 10)}${":"}${ts.toString.substring(11,13)}'")
.toList.toString().drop(4)
/** replacing ${partitionDateHour} value in query from current Thread value dynamically*/
val sparksql= spark_sql.replace("${partitionDateHour}",partitionDateHour)
spark.sql(sparksql)
Thread.sleep(300000)}}}).start()
scala.io.StdIn.readLine()
It was working properly for 5 to 6 hours before displaying the error message below.
Exception in thread "dispatcher-event-loop-5" in java.lang.OutOfMemoryError: GC overhead
limit exceeded
22/12/13 19:07:42 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning
+- *HashAggregate
+- Exchange hashpartitioning
+- *HashAggregate
+- *Filter
My Spark-submit Configuration :
--conf "spark.hadoop.hive.exec.dynamic.partition=true"
--conf "spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict"
--num-executors 1
--driver-cores 1
--driver-memory 1G
--executor-cores 1
--executor-memory 1G
I attempted to increase —executor-memory, but the issue persisted. I'm wondering if there's a way to release all of the GC process's resources after each thread so that unneeded objects and resources can be freed up.
Could someone please advise me on how to handle this situation?
The java.lang.OutOfMemoryError: GC Overhead limit exceeded occurs if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap. Read more over docs.oracle.com.
There could be two possible reasons: (1.) you are having a memory leak - in most of cases this turns out to be the root cause (2.) you are not using sufficient heap memory.
I would suggest do heap dump analysis and see if you have memory leak. Also, try to increase memory and see how it goes.
I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show() shows,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.
Expected Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
What I have tried so far,
Way 1: causing java.lang.OutOfMemoryError: Java heap space
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
Way 2: Not working because of missing native OFFSET on pyspark
%%time
i = 0
limit = 1000000
for offset in range(0, total_record,limit):
i = i + 1
if i != 1:
offset = offset + 1
duplicate_property_list = {}
duplicate_properties = {}
# Preparing dataframe
url = '''select id, hash_of_cc_pn_li from properties_df LIMIT {} OFFSET {}'''.format(limit,offset)
properties_sql_df = spark.sql(url)
# Grouping dataset
rows = properties_sql_df.groupBy("hash_of_cc_pn_li").agg(F.collect_set("id").alias("ids")).collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
# Filter a dictionary to keep elements only where duplicate cound
duplicate_properties = filterTheDict(duplicate_property_list, lambda elem : len(elem[1]) >=2)
# Writing to file
with open('duplicate_detected/duplicate_property_list_all_'+str(i)+'.json', 'w') as fp:
json.dump(duplicate_property_list, fp)
What I get now on the console:
java.lang.OutOfMemoryError: Java heap space
and showing this error on Jupyter notebook output
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
This is the followup question that I asked here: Creating dictionary from Pyspark dataframe showing OutOfMemoryError: Java heap space
Why not keep as much data and processing in Executors, rather than collecting to Driver? If I understand this correctly, you could use pyspark transformations and aggregations and save directly to JSON, therefore leveraging executors, then load that JSON file (likely partitioned) back into Python as a dictionary. Admittedly, you introduce IO overhead, but this should allow you to get around your OOM heap space errors. Step-by-step:
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
df.groupBy(
f.col("hash_of_cc_pn_li"),
).agg(
f.collect_set("id").alias("id") # use f.collect_list() here if you're not interested in deduplication of BOND-XXXXX values
).write.json("./test.json")
Inspecting the output path:
ls -l ./test.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 part-00000-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 50 Jul 27 08:29 part-00039-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00043-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 65 Jul 27 08:29 part-00159-1fb900a1-c624-4379-a652-8e5b9dee8651-c000.json
-rw-r--r-- 1 jovyan users 0 Jul 27 08:29 _SUCCESS
_SUCCESS
Loading to Python as dict:
import json
from glob import glob
data = []
for file_name in glob('./test.json/*.json'):
with open(file_name) as f:
try:
data.append(json.load(f))
except json.JSONDecodeError: # there is definitely a better way - this is here because some partitions might be empty
pass
Finally
{item['hash_of_cc_pn_li']:item['id'] for item in data}
{'d5c3': ['BOND-7630290', 'BOND-1742850'],
'811f': ['BOND-3211356'],
'90cb': ['BOND-9129450', 'BOND-7175508']}
I hope this helps! Thank you for the good question!
I am trying to write a spark dataframe to hdfs using partition by.
But it is throwing java heap space error.
Below is the cluster configuration and my spark configuration.
Cluster Configuration:
5 nodes
No of cores/node: 32 cores
RAM/Node: 252GB
Spark Configuration:
spark.driver.memory = 50g
spark.executor.cores = 10
spark.executor.memory = 40g
df_final is created by reading an avro file and doing some transformations (quite simple transformations like column split and adding new columns with default values)
The size of the source file is around 15M
df_final.count() = 361016
I am facing java heap space error while writing the final df to hdfs:
df_final.write.partitionBy("col A", "col B", "col C", "col D").mode("append").format("orc").save("output")
I even tried to use spark dynamic configuration:
spark.dynamicAllocation.enabled = 'true'
spark.shuffle.service.enabled = 'true'
Still having java heap space error.
I even tried to write the df without partitions but it still fails with java heap space error or GC overhead error.
This is the exact stage at which i am having java heap space error:
WARN TaskSetManager: Stage 30 contains a task of very large size (16648KB). The maximum recommended task size is 100 KB
How can I fine tune my spark configuration to avoid this java head space issue??
I am using spark to receive data from Kafka Stream to receive the status about IOT devices which are sending regular health updates and about state of the various sensors present in the devices . My Spark application listens to single topic to receive update messages from Kafka stream using Spark direct stream. I need to trigger different alarms based on the state of the sensors for each devices. However when I add more IOT devices which sends data to spark using Kakfa, Spark does not scale despite adding more number of machines and with number of executors increased . Below I have given the strip down version of my Spark application where notification triggering part removed with the same performance issues.
// Method for update the Device state , it just a in memory object which tracks the device state .
private static Optional<DeviceState> trackDeviceState(Time time, String key, Optional<ProtoBufEventUpdate> updateOpt,
State<DeviceState> state) {
int batchTime = toSeconds(time);
ProtoBufEventUpdate eventUpdate = (updateOpt == null)?null:updateOpt.orNull();
if(eventUpdate!=null)
eventUpdate.setBatchTime(ProximityUtil.toSeconds(time));
if (state!=null && state.exists()) {
DeviceState deviceState = state.get();
if (state.isTimingOut()) {
deviceState.markEnd(batchTime);
}
if (updateOpt.isPresent()) {
deviceState = DeviceState.updatedDeviceState(deviceState, eventUpdate);
state.update(deviceState);
}
} else if (updateOpt.isPresent()) {
DeviceState deviceState = DeviceState.newDeviceState(eventUpdate);
state.update(deviceState);
return Optional.of(deviceState);
}
return Optional.absent();
}
SparkConf conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
.set("spark.rpc.netty.dispatcher.numThreads", String.valueOf(Runtime.getRuntime().availableProcessors()))
JavaStreamingContext context= new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put( “zookeeper.connect”, “192.168.60.20:2181,192.168.60.21:2181,192.168.60.22:2181”);
kafkaParams.put("metadata.broker.list", “192.168.60.20:9092,192.168.60.21:9092,192.168.60.22:9092”);
kafkaParams.put(“group.id”, “spark_iot”);
HashSet<String> topics=new HashSet<>();
topics.add(“iottopic”);
JavaPairInputDStream<String, ProtoBufEventUpdate> inputStream = KafkaUtils.
createDirectStream(context, String.class, ProtoBufEventUpdate.class, KafkaKryoCodec.class, ProtoBufEventUpdateCodec.class, kafkaParams, topics);
JavaPairDStream<String, ProtoBufEventUpdate> updatesStream = inputStream.mapPartitionsToPair(t -> {
List<Tuple2<String, ProtoBufEventUpdate>> eventupdateList=new ArrayList<>();
t.forEachRemaining(tuple->{
String key=tuple._1;
ProtoBufEventUpdate eventUpdate =tuple._2;
Util.mergeStateFromStats(eventUpdate);
eventupdateList.add(new Tuple2<String, ProtoBufEventUpdate>(key,eventUpdate));
});
return eventupdateList.iterator();
});
JavaMapWithStateDStream<String, ProtoBufEventUpdate, DeviceState, DeviceState> devceMapStream = null;
devceMapStream=updatesStream.mapWithState(StateSpec.function(Engine::trackDeviceState)
.numPartitions(20)
.timeout(Durations.seconds(1800)));
devceMapStream.checkpoint(new Duration(batchDuration*1000));
JavaPairDStream<String, DeviceState> deviceStateStream = devceMapStream
.stateSnapshots()
.cache();
deviceStateStream.foreachRDD(rdd->{
if(rdd != null && !rdd.isEmpty()){
rdd.foreachPartition(tuple->{
tuple.forEachRemaining(t->{
SparkExecutorLog.error("Engine::getUpdates Tuple data "+ t._2);
});
});
}
});
Even when the load is increasing I don't see the CPU usage increasing for Executor instances . Most of the time Executor instances CPU is idling. I tried increasing kakfa paritions (Currently Kafka is having 72 partitions. I did try to bring it down to 36 also) . Also I tried increasing devceMapStream partitions . but I couldn't see any performance improvements . The code is not spending any time on IO.
I am running our Spark Appication with 6 executor instances on Amazon EMR(Yarn) with each machine having 4 cores and 32 gb Ram. It tried to increate the number of executor instances to 9 then to 15, but didn't see any performance improvement. Also Played a bit around on spark.default.parallelism value by setting it 20, 36, 72, 100 , but I could see 20 was the one which gave me better performance (Maybe number of cores per executor has some influence on this) .
spark-submit --deploy-mode cluster --class com.ajay.Engine --supervise --driver-memory 5G --driver-cores 8 --executor-memory 4G --executor-cores 4 --conf spark.default.parallelism=20 --num-executors 36 --conf spark.dynamicAllocation.enabled=false --conf spark.streaming.unpersist=false --conf spark.eventLog.enabled=false --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties --conf spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError --conf spark.executor.extraJavaOptions=-XX:HeapDumpPath=/tmp --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties s3://test/engine.jar
At present Spark is struggling to complete the processing in 10 seconds (I have even tried different batch duration like 5, 10, 15 etc) . Its taking 15-23 seconds to complete one batch with input rate of 1600 records per seconds and having 17000 records for each batch. I need to use statesteam to check the state of the devices periodically to see whether the device is raising any alarms or any sensors have stopped responding. I am not sure how I can improve the performance my spark application ?
mapWithState does the following:
applying a function to every key-value element of this stream, while maintaining some state data for each unique key
as per its docs: PairDStreamFunctions#mapWithState
which also means that for every batch all the elements with the same key are processed in sequence, and, because the function in StateSpec is arbitrary and provided by us, with no state combiners defined, it can't be parallelized any further, no matter how you partition the data before mapWithState. I.e. when keys are diverse, parallelization will be good, but if all the RDD elements have just a few unique keys among them, then the whole batch will be mostly processed by just the number of cores equal to the number of unique keys.
In your case, keys come from Kafka:
t.forEachRemaining(tuple->{
String key=tuple._1;
and your code snippet doesn't show how they are generated.
From my experience, this is what may be happening: some part of your batches is getting quickly processed by multiple cores, and another part, having same key for a substantial part of the whole, takes more time and delays the batch, and that's why you see just a few tasks running most of the time, while there are under-loaded executors.
To see if it's true, check your keys distribution, how many elements are there for each key, can it be that just a couple of keys has 20% of all the elements? If this is true, you have these options:
change your keys generation algorithm
artificially split problematic keys before mapWithState and combine state snapshots later to make sense for the whole
cap the number of elements with the same key to be processed in each batch, either ignore elements after first N in every batch, or send them elsewhere, into some "can't process in time" Kafka stream and deal with them separately
I run the following hql:
select new.uid as uid, new.category_id as category_id, new.atag as atag,
new.rank_idx + CASE when old.rank_idx is not NULL then old.rank_idx else 0 END as rank_idx
from (
select a1.uid, a1.category_id, a1.atag, row_number() over(distribute by a1.uid, a1.category_id sort by a1.cmt_time) as rank_idx from (
select app.uid,
CONCAT(cast(app.knowledge_point_id_list[0] as string),'#',cast(app.type_id as string)) as category_id,
app.atag as atag, app.cmt_time as cmt_time
from model.mdl_psr_app_behavior_question_result app
where app.subject = 'english'
and app.dt = '2016-01-14'
and app.cmt_timelen > 1000
and app.cmt_timelen < 120000
) a1
) new
left join (
select uid, category_id, rank_idx from model.mdl_psr_mlc_app_count_last
where subject = 'english'
and dt = '2016-01-13'
) old
on new.uid = old.uid
and new.category_id = old.category_id
Originally mdl_psr_mlc_app_count_last and mdl_psr_mlc_app_count_day are stored as JsonSerde, the query runs.
My colleague suggests that JsonSerde is highly inefficient and occupies too much space. PARQUET is a better choice for me.
When I did, the query broke with the following error log:
org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1 rows: used memory = 1024506232
2016-01-19 16:36:56,119 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10 rows: used memory = 1024506232
2016-01-19 16:36:56,130 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100 rows: used memory = 1024506232
2016-01-19 16:36:56,248 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1000 rows: used memory = 1035075896
2016-01-19 16:36:56,694 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10000 rows: used memory = 1045645560
2016-01-19 16:36:57,056 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100000 rows: used memory = 1065353232
It looks like java memory problem. Somebody suggests me to try:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=8048;
SET mapreduce.reduce.java.opts='-Xmx8048M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx4096M';
set mapred.child.map.java.opts='-Xmx4096M';
It still breaks, with the same error message. Now someone else suggests:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=1024;
SET mapreduce.reduce.java.opts='-Xmx1024M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx1024M';
set mapreduce.child.map.java.opts='-Xmx1024M';
set mapred.reduce.tasks = 40;
Now it runs without a glitch.
Can someone explain me why?
================================
btw: although it runs, the reduce step is very slow. While you are at it, could you explain me why?
For some reason, YARN has a poor support for parquet format.
Quote Mapr
For example, if the MapReduce job sorts parquet files, Mapper needs to cache the whole Parquet row group in memory. I have done tests to prove that the larger the row group size of parquet files is, the larger Mapper memory is needed. In this case, make sure the Mapper memory is large enough without triggering OOM.
I am not exactly sure why different setting in the question matters, but the simple solution is to drop parquet and use orc. A bit performance loss in exchange of bug free.