Pyspark - Write DF into partitions efficiently - java

I am trying to write a spark dataframe to hdfs using partition by.
But it is throwing java heap space error.
Below is the cluster configuration and my spark configuration.
Cluster Configuration:
5 nodes
No of cores/node: 32 cores
RAM/Node: 252GB
Spark Configuration:
spark.driver.memory = 50g
spark.executor.cores = 10
spark.executor.memory = 40g
df_final is created by reading an avro file and doing some transformations (quite simple transformations like column split and adding new columns with default values)
The size of the source file is around 15M
df_final.count() = 361016
I am facing java heap space error while writing the final df to hdfs:
df_final.write.partitionBy("col A", "col B", "col C", "col D").mode("append").format("orc").save("output")
I even tried to use spark dynamic configuration:
spark.dynamicAllocation.enabled = 'true'
spark.shuffle.service.enabled = 'true'
Still having java heap space error.
I even tried to write the df without partitions but it still fails with java heap space error or GC overhead error.
This is the exact stage at which i am having java heap space error:
WARN TaskSetManager: Stage 30 contains a task of very large size (16648KB). The maximum recommended task size is 100 KB
How can I fine tune my spark configuration to avoid this java head space issue??

Related

Exception in thread "dispatcher-event-loop-5" java.lang.OutOfMemoryError: GC overhead limit exceeded : Spark

Could someone please make a suggestion here? I am executing the following code using the spark submit command, which is taking data from the src hive table (hive is taking data from the spark streaming job and the source is Kafka) every 5 minutes and doing some aggregation over the last 2 hours of partition at the target side.
val sparksql="insert OVERWRITE table hivedesttable(partition_date,partition_hour)
select "some business logic with aggregation and group by condition" from hivesrctable
where concat(partitionDate,":",partitionHour) in ${partitionDateHour}"
new Thread(new Runnable {
override def run(): Unit = {
while (true) {
val currentTs = java.time.LocalDateTime.now
var partitionDateHour = (0 until 2)
.map(h => currentTs.minusHours(h))
.map(ts => s"'${ts.toString.substring(0, 10)}${":"}${ts.toString.substring(11,13)}'")
.toList.toString().drop(4)
/** replacing ${partitionDateHour} value in query from current Thread value dynamically*/
val sparksql= spark_sql.replace("${partitionDateHour}",partitionDateHour)
spark.sql(sparksql)
Thread.sleep(300000)}}}).start()
scala.io.StdIn.readLine()
It was working properly for 5 to 6 hours before displaying the error message below.
Exception in thread "dispatcher-event-loop-5" in java.lang.OutOfMemoryError: GC overhead
limit exceeded
22/12/13 19:07:42 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning
+- *HashAggregate
+- Exchange hashpartitioning
+- *HashAggregate
+- *Filter
My Spark-submit Configuration :
--conf "spark.hadoop.hive.exec.dynamic.partition=true"
--conf "spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict"
--num-executors 1
--driver-cores 1
--driver-memory 1G
--executor-cores 1
--executor-memory 1G
I attempted to increase —executor-memory, but the issue persisted. I'm wondering if there's a way to release all of the GC process's resources after each thread so that unneeded objects and resources can be freed up.
Could someone please advise me on how to handle this situation?
The java.lang.OutOfMemoryError: GC Overhead limit exceeded occurs if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap. Read more over docs.oracle.com.
There could be two possible reasons: (1.) you are having a memory leak - in most of cases this turns out to be the root cause (2.) you are not using sufficient heap memory.
I would suggest do heap dump analysis and see if you have memory leak. Also, try to increase memory and see how it goes.

How to reduce task size in Spark MLib?

I'm trying to implement Random Forest Classifier using Apache Spark (2.2.0) and Java.
Basically I've followed the example from the Spark documentation
For test purposes I'm using a local cluster:
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName(appName)
.getOrCreate();
My training/test data includes 30k rows. Data is fetched from REST APIs and transformed to Spark DataSet.
List<PreparedWUMLogFile> logs = //... get from REST API
Dataset<PreparedWUMLogFile> dataSet = spark.createDataset(logs, Encoders.bean(PreparedWUMLogFile.class));
Dataset<Row> data = dataSet.toDF();
For many stages I get the following warning message:
[warn] o.a.s.s.TaskSetManager - Stage 0 contains a task of very large size (3002 KB). The maximum recommended task size is 100 KB.
How I can reduce the task size in this case?
Edit:
To be more concrete: There are 5 of 30 stages that produce these warning messages.
rdd at StringIndexer.scala:111 (two times)
take at VectorIndexer.scala:119
rdd at VectorIndexer.scala:122
rdd at Classifier.scala:82

Spark MapwithState stateSnapshots Not scaling (Java)

I am using spark to receive data from Kafka Stream to receive the status about IOT devices which are sending regular health updates and about state of the various sensors present in the devices . My Spark application listens to single topic to receive update messages from Kafka stream using Spark direct stream. I need to trigger different alarms based on the state of the sensors for each devices. However when I add more IOT devices which sends data to spark using Kakfa, Spark does not scale despite adding more number of machines and with number of executors increased . Below I have given the strip down version of my Spark application where notification triggering part removed with the same performance issues.
// Method for update the Device state , it just a in memory object which tracks the device state .
private static Optional<DeviceState> trackDeviceState(Time time, String key, Optional<ProtoBufEventUpdate> updateOpt,
State<DeviceState> state) {
int batchTime = toSeconds(time);
ProtoBufEventUpdate eventUpdate = (updateOpt == null)?null:updateOpt.orNull();
if(eventUpdate!=null)
eventUpdate.setBatchTime(ProximityUtil.toSeconds(time));
if (state!=null && state.exists()) {
DeviceState deviceState = state.get();
if (state.isTimingOut()) {
deviceState.markEnd(batchTime);
}
if (updateOpt.isPresent()) {
deviceState = DeviceState.updatedDeviceState(deviceState, eventUpdate);
state.update(deviceState);
}
} else if (updateOpt.isPresent()) {
DeviceState deviceState = DeviceState.newDeviceState(eventUpdate);
state.update(deviceState);
return Optional.of(deviceState);
}
return Optional.absent();
}
SparkConf conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
.set("spark.rpc.netty.dispatcher.numThreads", String.valueOf(Runtime.getRuntime().availableProcessors()))
JavaStreamingContext context= new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put( “zookeeper.connect”, “192.168.60.20:2181,192.168.60.21:2181,192.168.60.22:2181”);
kafkaParams.put("metadata.broker.list", “192.168.60.20:9092,192.168.60.21:9092,192.168.60.22:9092”);
kafkaParams.put(“group.id”, “spark_iot”);
HashSet<String> topics=new HashSet<>();
topics.add(“iottopic”);
JavaPairInputDStream<String, ProtoBufEventUpdate> inputStream = KafkaUtils.
createDirectStream(context, String.class, ProtoBufEventUpdate.class, KafkaKryoCodec.class, ProtoBufEventUpdateCodec.class, kafkaParams, topics);
JavaPairDStream<String, ProtoBufEventUpdate> updatesStream = inputStream.mapPartitionsToPair(t -> {
List<Tuple2<String, ProtoBufEventUpdate>> eventupdateList=new ArrayList<>();
t.forEachRemaining(tuple->{
String key=tuple._1;
ProtoBufEventUpdate eventUpdate =tuple._2;
Util.mergeStateFromStats(eventUpdate);
eventupdateList.add(new Tuple2<String, ProtoBufEventUpdate>(key,eventUpdate));
});
return eventupdateList.iterator();
});
JavaMapWithStateDStream<String, ProtoBufEventUpdate, DeviceState, DeviceState> devceMapStream = null;
devceMapStream=updatesStream.mapWithState(StateSpec.function(Engine::trackDeviceState)
.numPartitions(20)
.timeout(Durations.seconds(1800)));
devceMapStream.checkpoint(new Duration(batchDuration*1000));
JavaPairDStream<String, DeviceState> deviceStateStream = devceMapStream
.stateSnapshots()
.cache();
deviceStateStream.foreachRDD(rdd->{
if(rdd != null && !rdd.isEmpty()){
rdd.foreachPartition(tuple->{
tuple.forEachRemaining(t->{
SparkExecutorLog.error("Engine::getUpdates Tuple data "+ t._2);
});
});
}
});
Even when the load is increasing I don't see the CPU usage increasing for Executor instances . Most of the time Executor instances CPU is idling. I tried increasing kakfa paritions (Currently Kafka is having 72 partitions. I did try to bring it down to 36 also) . Also I tried increasing devceMapStream partitions . but I couldn't see any performance improvements . The code is not spending any time on IO.
I am running our Spark Appication with 6 executor instances on Amazon EMR(Yarn) with each machine having 4 cores and 32 gb Ram. It tried to increate the number of executor instances to 9 then to 15, but didn't see any performance improvement. Also Played a bit around on spark.default.parallelism value by setting it 20, 36, 72, 100 , but I could see 20 was the one which gave me better performance (Maybe number of cores per executor has some influence on this) .
spark-submit --deploy-mode cluster --class com.ajay.Engine --supervise --driver-memory 5G --driver-cores 8 --executor-memory 4G --executor-cores 4 --conf spark.default.parallelism=20 --num-executors 36 --conf spark.dynamicAllocation.enabled=false --conf spark.streaming.unpersist=false --conf spark.eventLog.enabled=false --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties --conf spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError --conf spark.executor.extraJavaOptions=-XX:HeapDumpPath=/tmp --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties s3://test/engine.jar
At present Spark is struggling to complete the processing in 10 seconds (I have even tried different batch duration like 5, 10, 15 etc) . Its taking 15-23 seconds to complete one batch with input rate of 1600 records per seconds and having 17000 records for each batch. I need to use statesteam to check the state of the devices periodically to see whether the device is raising any alarms or any sensors have stopped responding. I am not sure how I can improve the performance my spark application ?
mapWithState does the following:
applying a function to every key-value element of this stream, while maintaining some state data for each unique key
as per its docs: PairDStreamFunctions#mapWithState
which also means that for every batch all the elements with the same key are processed in sequence, and, because the function in StateSpec is arbitrary and provided by us, with no state combiners defined, it can't be parallelized any further, no matter how you partition the data before mapWithState. I.e. when keys are diverse, parallelization will be good, but if all the RDD elements have just a few unique keys among them, then the whole batch will be mostly processed by just the number of cores equal to the number of unique keys.
In your case, keys come from Kafka:
t.forEachRemaining(tuple->{
String key=tuple._1;
and your code snippet doesn't show how they are generated.
From my experience, this is what may be happening: some part of your batches is getting quickly processed by multiple cores, and another part, having same key for a substantial part of the whole, takes more time and delays the batch, and that's why you see just a few tasks running most of the time, while there are under-loaded executors.
To see if it's true, check your keys distribution, how many elements are there for each key, can it be that just a couple of keys has 20% of all the elements? If this is true, you have these options:
change your keys generation algorithm
artificially split problematic keys before mapWithState and combine state snapshots later to make sense for the whole
cap the number of elements with the same key to be processed in each batch, either ignore elements after first N in every batch, or send them elsewhere, into some "can't process in time" Kafka stream and deal with them separately

Apache Spark with Spark JobServer crash after some hours

I'm using Apache Spark 2.0.2 together with Apache JobServer 0.7.0.
I know this is not a best practice but this is a first step. My server have 52 Gb RAM and 6 CPU Cores, Cent OS 7 x64, Java(TM) SE Runtime Environment (build 1.7.0_79-b15) and it have the following running applications with the specified memory configuration.
JBoss AS 7 (6 Gb)
PDI Pentaho 6.0 (12 Gb)
MySQL (20 Gb)
Apache Spark 2.0.2 (8 Gb)
I start it and everything works as expected. And works so for several hours.
I have a jar with 2 implemented jobs who extends from My_Job class.
public class VIQ_SparkJob extends JavaSparkJob {
protected SparkSession sparkSession;
protected String TENANT_ID;
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
sparkSession = SparkSession.builder()
.sparkContext(ctx)
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer", "8m")
.getOrCreate();
Class<?>[] classes = new Class<?>[2];
classes[0] = UsersCube.class;
classes[1] = ImportCSVFiles.class;
sparkSession.sparkContext().conf().registerKryoClasses(classes);
TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");
return true;
}
#Override
public SparkJobValidation validate(SparkContext sc, Config config) {
return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}
}
This Job import the data from some .csv files and store them as parquet files partitioned by tenant. Are 2 entities users which ocupe 674 Mb in disk as a parquet files and user_processed with 323 Mb.
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
super.runJob(jsc, jobConfig);
String entity= jobConfig.getString("entity");
Dataset<Row> ds = sparkSession.read()
.option("header", "true")
.option("inferschema", true)
.csv(csvPath);
ds.withColumn("tenant_id", ds.col("tenant_id").cast("int"))
.write()
.mode(SaveMode.Append)
.partitionBy(JavaConversions.asScalaBuffer(asList("tenant_id")))
.parquet("/value_iq/spark-warehouse/"+entity);
return null;
}
This one is to query the parquet files:
#Override
public Object runJob(SparkContext jsc, Config jobConfig) {
super.runJob(jsc, jobConfig); //To change body of generated methods, choose Tools | Templates.
String query = jobConfig.getString("query");
Dataset<Row> lookup_values = getDataFrameFromMySQL("value_iq", "lookup_values").filter(new Column("lookup_domain").equalTo("customer_type"));
Dataset<Row> users = getDataFrameFromParket(USERS + "/tenant_id=" + TENANT_ID);
Dataset<Row> user_profiles = getDataFrameFromParket(USER_PROCESSED + "/tenant_id=" + TENANT_ID);
lookup_values.createOrReplaceTempView("lookup_values");
users.createOrReplaceTempView("users");
user_profiles.createOrReplaceTempView("user_processed");
//CREATING VIEWS DE AND EN
sparkSession
.sql(Here I join the 3 datasets)
.coalesce(200)
.createOrReplaceTempView("cube_users_v_de");
List<String> list = sparkSession.sql(query).limit(1000).toJSON().takeAsList(1000);
String result = "[";
for (int i = 0; i < list.size(); i++) {
result += (i == 0 ? "" : ",") + list.get(i);
}
result += "]";
return result;
}
Every day I run the first job saving to parquet files some csv. And during the day I execute some queries to the second one. But after some hours crash because of out of memory this the error log:
k.memory.TaskMemoryManager [] [akka://JobServer/user/context-supervisor/application_analytics] - Failed to allocate a page (8388608 bytes), try again.
WARN .netty.NettyRpcEndpointRef [] [] - Error sending message [message = Heartbeat(0,[Lscala.Tuple2;#18c54652,BlockManagerId(0, 157.97.107.42, 55223))] in 1 attempt
I have the master and one worker in this server. My spark-defaults.conf
spark.debug.maxToStringFields 256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true
This is my Spark Jobserver settings.sh
DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8
I create my context with the following curl:
curl -k --basic --user 'user:password' -d "" 'https://localhost:4810/contexts/application?num-cpu-cores=5&memory-per-node=8G'
And the spark driver use 2Gb.
The created application looks like
ExecutorID Worker Cores Memory State Logs
0 worker-20170203084218-157.97.107.42-50199 5 8192 RUNNING stdout stderr
Those are my executors
Executor ID Address ▴ Status RDD Blocks Storage Memory Disk Used Cores
driver 157.97.107.42:55222 Active 0 0.0 B / 1018.9 MB 0.0 B 0
0 157.97.107.42:55223 Active 0 0.0 B / 4.1 GB 0.0 B 5
I have a process who checks the memory used per process and the top amount was 8468 Mb.
There are 4 processes related with spark.
The master process. Start with 1Gb memory assigned, I don't know from where this configuration cames. But seems to be enough. Use only 0.4 Gb at top.
The worker process. The same as the master with the memory use.
The driver process. Who have 2Gb configured.
The context. Who have 8Gb configured.
In the following table you can see how the memory used by the driver and contexts behaves. After getting the java.lang.OutOfMemoryError: Java heap space. The context fails, but the driver acept another context, so it remains fine.
system_user | RAM(Mb) | entry_date
--------------+----------+---------------------
spark.driver 2472.11 2017-02-07 10:10:18 //Till here everything was fine
spark.context 5470.19 2017-02-07 10:10:18 //it was running for more thant 48 hours
spark.driver 2472.11 2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context 0.00 2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
//in $LOG_FOLDER/job-server-master/server_startup.log
# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed
spark.driver 2472.11 2017-02-07 10:16:18 //Here I have deleted and created again
spark.context 105.20 2017-02-07 10:16:18
spark.driver 2577.30 2017-02-07 10:19:18 //Here I execute the three big
spark.context 3734.46 2017-02-07 10:19:18 //concurrent queries again.
spark.driver 2577.30 2017-02-07 10:20:18 //Here after the queries where
spark.context 5154.60 2017-02-07 10:20:18 //executed. No memory issue.
I have 2 questions:
1- Why when I check the spark GUI my driver who has 2 configured Gb only use 1, the same with the executor 0 which only use 4.4 Gb. Where goes the other configured memory? But when the processes in the system the driver it use 2Gb.
2- If I have enough memory on the server then why I'm out of memory?

Java heap size memory at map step on hive sql

I run the following hql:
select new.uid as uid, new.category_id as category_id, new.atag as atag,
new.rank_idx + CASE when old.rank_idx is not NULL then old.rank_idx else 0 END as rank_idx
from (
select a1.uid, a1.category_id, a1.atag, row_number() over(distribute by a1.uid, a1.category_id sort by a1.cmt_time) as rank_idx from (
select app.uid,
CONCAT(cast(app.knowledge_point_id_list[0] as string),'#',cast(app.type_id as string)) as category_id,
app.atag as atag, app.cmt_time as cmt_time
from model.mdl_psr_app_behavior_question_result app
where app.subject = 'english'
and app.dt = '2016-01-14'
and app.cmt_timelen > 1000
and app.cmt_timelen < 120000
) a1
) new
left join (
select uid, category_id, rank_idx from model.mdl_psr_mlc_app_count_last
where subject = 'english'
and dt = '2016-01-13'
) old
on new.uid = old.uid
and new.category_id = old.category_id
Originally mdl_psr_mlc_app_count_last and mdl_psr_mlc_app_count_day are stored as JsonSerde, the query runs.
My colleague suggests that JsonSerde is highly inefficient and occupies too much space. PARQUET is a better choice for me.
When I did, the query broke with the following error log:
org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1 rows: used memory = 1024506232
2016-01-19 16:36:56,119 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10 rows: used memory = 1024506232
2016-01-19 16:36:56,130 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100 rows: used memory = 1024506232
2016-01-19 16:36:56,248 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 1000 rows: used memory = 1035075896
2016-01-19 16:36:56,694 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 10000 rows: used memory = 1045645560
2016-01-19 16:36:57,056 INFO [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: ExecMapper: processing 100000 rows: used memory = 1065353232
It looks like java memory problem. Somebody suggests me to try:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=8048;
SET mapreduce.reduce.java.opts='-Xmx8048M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx4096M';
set mapred.child.map.java.opts='-Xmx4096M';
It still breaks, with the same error message. Now someone else suggests:
SET mapred.child.java.opts=-Xmx900m;
SET mapreduce.reduce.memory.mb=1024;
SET mapreduce.reduce.java.opts='-Xmx1024M';
SET mapreduce.map.memory.mb=1024;
set mapreduce.map.java.opts='-Xmx1024M';
set mapreduce.child.map.java.opts='-Xmx1024M';
set mapred.reduce.tasks = 40;
Now it runs without a glitch.
Can someone explain me why?
================================
btw: although it runs, the reduce step is very slow. While you are at it, could you explain me why?
For some reason, YARN has a poor support for parquet format.
Quote Mapr
For example, if the MapReduce job sorts parquet files, Mapper needs to cache the whole Parquet row group in memory. I have done tests to prove that the larger the row group size of parquet files is, the larger Mapper memory is needed. In this case, make sure the Mapper memory is large enough without triggering OOM.
I am not exactly sure why different setting in the question matters, but the simple solution is to drop parquet and use orc. A bit performance loss in exchange of bug free.

Categories