How to reduce task size in Spark MLib? - java

I'm trying to implement Random Forest Classifier using Apache Spark (2.2.0) and Java.
Basically I've followed the example from the Spark documentation
For test purposes I'm using a local cluster:
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.appName(appName)
.getOrCreate();
My training/test data includes 30k rows. Data is fetched from REST APIs and transformed to Spark DataSet.
List<PreparedWUMLogFile> logs = //... get from REST API
Dataset<PreparedWUMLogFile> dataSet = spark.createDataset(logs, Encoders.bean(PreparedWUMLogFile.class));
Dataset<Row> data = dataSet.toDF();
For many stages I get the following warning message:
[warn] o.a.s.s.TaskSetManager - Stage 0 contains a task of very large size (3002 KB). The maximum recommended task size is 100 KB.
How I can reduce the task size in this case?
Edit:
To be more concrete: There are 5 of 30 stages that produce these warning messages.
rdd at StringIndexer.scala:111 (two times)
take at VectorIndexer.scala:119
rdd at VectorIndexer.scala:122
rdd at Classifier.scala:82

Related

Pyspark - Write DF into partitions efficiently

I am trying to write a spark dataframe to hdfs using partition by.
But it is throwing java heap space error.
Below is the cluster configuration and my spark configuration.
Cluster Configuration:
5 nodes
No of cores/node: 32 cores
RAM/Node: 252GB
Spark Configuration:
spark.driver.memory = 50g
spark.executor.cores = 10
spark.executor.memory = 40g
df_final is created by reading an avro file and doing some transformations (quite simple transformations like column split and adding new columns with default values)
The size of the source file is around 15M
df_final.count() = 361016
I am facing java heap space error while writing the final df to hdfs:
df_final.write.partitionBy("col A", "col B", "col C", "col D").mode("append").format("orc").save("output")
I even tried to use spark dynamic configuration:
spark.dynamicAllocation.enabled = 'true'
spark.shuffle.service.enabled = 'true'
Still having java heap space error.
I even tried to write the df without partitions but it still fails with java heap space error or GC overhead error.
This is the exact stage at which i am having java heap space error:
WARN TaskSetManager: Stage 30 contains a task of very large size (16648KB). The maximum recommended task size is 100 KB
How can I fine tune my spark configuration to avoid this java head space issue??

window Data hourly(clockwise) basis in Apache beam

I am trying to aggregate streaming data for each hour(like 12:00 to 12:59 and 01:00 to 01:59) in DataFlow/Apache Beam Job.
Following is my use case
Data is streaming from pubsub, It has a timestamp(order date). I want to count no of orders in each hour i am getting, Also i want to allow delay of 5 hours. Following is my sample code that I am using
LOG.info("Start Running Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection<String> tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
.apply("Flatten Data from PubSub", Flatten.<String>pCollections());
flattenData
.apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))
// Adding Window
.apply(
Window.<SalesAndUnits>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardMinutes(1)))
)
// Data Enrich with Dimensions
.apply(ParDo.of(new DataEnrichWithDimentions()))
// Group And Hourly Sum
.apply(new GroupAndSumSales())
.apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
LOG.info("Finish Running Pipeline");
I'd the use a window with the requirements you have. Something along the lines of
Window.into(
FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))
Possibly followed by a count as that's what I understood you need.
Hope it helps

Spark MapwithState stateSnapshots Not scaling (Java)

I am using spark to receive data from Kafka Stream to receive the status about IOT devices which are sending regular health updates and about state of the various sensors present in the devices . My Spark application listens to single topic to receive update messages from Kafka stream using Spark direct stream. I need to trigger different alarms based on the state of the sensors for each devices. However when I add more IOT devices which sends data to spark using Kakfa, Spark does not scale despite adding more number of machines and with number of executors increased . Below I have given the strip down version of my Spark application where notification triggering part removed with the same performance issues.
// Method for update the Device state , it just a in memory object which tracks the device state .
private static Optional<DeviceState> trackDeviceState(Time time, String key, Optional<ProtoBufEventUpdate> updateOpt,
State<DeviceState> state) {
int batchTime = toSeconds(time);
ProtoBufEventUpdate eventUpdate = (updateOpt == null)?null:updateOpt.orNull();
if(eventUpdate!=null)
eventUpdate.setBatchTime(ProximityUtil.toSeconds(time));
if (state!=null && state.exists()) {
DeviceState deviceState = state.get();
if (state.isTimingOut()) {
deviceState.markEnd(batchTime);
}
if (updateOpt.isPresent()) {
deviceState = DeviceState.updatedDeviceState(deviceState, eventUpdate);
state.update(deviceState);
}
} else if (updateOpt.isPresent()) {
DeviceState deviceState = DeviceState.newDeviceState(eventUpdate);
state.update(deviceState);
return Optional.of(deviceState);
}
return Optional.absent();
}
SparkConf conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
.set("spark.rpc.netty.dispatcher.numThreads", String.valueOf(Runtime.getRuntime().availableProcessors()))
JavaStreamingContext context= new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put( “zookeeper.connect”, “192.168.60.20:2181,192.168.60.21:2181,192.168.60.22:2181”);
kafkaParams.put("metadata.broker.list", “192.168.60.20:9092,192.168.60.21:9092,192.168.60.22:9092”);
kafkaParams.put(“group.id”, “spark_iot”);
HashSet<String> topics=new HashSet<>();
topics.add(“iottopic”);
JavaPairInputDStream<String, ProtoBufEventUpdate> inputStream = KafkaUtils.
createDirectStream(context, String.class, ProtoBufEventUpdate.class, KafkaKryoCodec.class, ProtoBufEventUpdateCodec.class, kafkaParams, topics);
JavaPairDStream<String, ProtoBufEventUpdate> updatesStream = inputStream.mapPartitionsToPair(t -> {
List<Tuple2<String, ProtoBufEventUpdate>> eventupdateList=new ArrayList<>();
t.forEachRemaining(tuple->{
String key=tuple._1;
ProtoBufEventUpdate eventUpdate =tuple._2;
Util.mergeStateFromStats(eventUpdate);
eventupdateList.add(new Tuple2<String, ProtoBufEventUpdate>(key,eventUpdate));
});
return eventupdateList.iterator();
});
JavaMapWithStateDStream<String, ProtoBufEventUpdate, DeviceState, DeviceState> devceMapStream = null;
devceMapStream=updatesStream.mapWithState(StateSpec.function(Engine::trackDeviceState)
.numPartitions(20)
.timeout(Durations.seconds(1800)));
devceMapStream.checkpoint(new Duration(batchDuration*1000));
JavaPairDStream<String, DeviceState> deviceStateStream = devceMapStream
.stateSnapshots()
.cache();
deviceStateStream.foreachRDD(rdd->{
if(rdd != null && !rdd.isEmpty()){
rdd.foreachPartition(tuple->{
tuple.forEachRemaining(t->{
SparkExecutorLog.error("Engine::getUpdates Tuple data "+ t._2);
});
});
}
});
Even when the load is increasing I don't see the CPU usage increasing for Executor instances . Most of the time Executor instances CPU is idling. I tried increasing kakfa paritions (Currently Kafka is having 72 partitions. I did try to bring it down to 36 also) . Also I tried increasing devceMapStream partitions . but I couldn't see any performance improvements . The code is not spending any time on IO.
I am running our Spark Appication with 6 executor instances on Amazon EMR(Yarn) with each machine having 4 cores and 32 gb Ram. It tried to increate the number of executor instances to 9 then to 15, but didn't see any performance improvement. Also Played a bit around on spark.default.parallelism value by setting it 20, 36, 72, 100 , but I could see 20 was the one which gave me better performance (Maybe number of cores per executor has some influence on this) .
spark-submit --deploy-mode cluster --class com.ajay.Engine --supervise --driver-memory 5G --driver-cores 8 --executor-memory 4G --executor-cores 4 --conf spark.default.parallelism=20 --num-executors 36 --conf spark.dynamicAllocation.enabled=false --conf spark.streaming.unpersist=false --conf spark.eventLog.enabled=false --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties --conf spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError --conf spark.executor.extraJavaOptions=-XX:HeapDumpPath=/tmp --conf spark.executor.extraJavaOptions=-XX:+UseG1GC --conf spark.driver.extraJavaOptions=-XX:+UseG1GC --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties s3://test/engine.jar
At present Spark is struggling to complete the processing in 10 seconds (I have even tried different batch duration like 5, 10, 15 etc) . Its taking 15-23 seconds to complete one batch with input rate of 1600 records per seconds and having 17000 records for each batch. I need to use statesteam to check the state of the devices periodically to see whether the device is raising any alarms or any sensors have stopped responding. I am not sure how I can improve the performance my spark application ?
mapWithState does the following:
applying a function to every key-value element of this stream, while maintaining some state data for each unique key
as per its docs: PairDStreamFunctions#mapWithState
which also means that for every batch all the elements with the same key are processed in sequence, and, because the function in StateSpec is arbitrary and provided by us, with no state combiners defined, it can't be parallelized any further, no matter how you partition the data before mapWithState. I.e. when keys are diverse, parallelization will be good, but if all the RDD elements have just a few unique keys among them, then the whole batch will be mostly processed by just the number of cores equal to the number of unique keys.
In your case, keys come from Kafka:
t.forEachRemaining(tuple->{
String key=tuple._1;
and your code snippet doesn't show how they are generated.
From my experience, this is what may be happening: some part of your batches is getting quickly processed by multiple cores, and another part, having same key for a substantial part of the whole, takes more time and delays the batch, and that's why you see just a few tasks running most of the time, while there are under-loaded executors.
To see if it's true, check your keys distribution, how many elements are there for each key, can it be that just a couple of keys has 20% of all the elements? If this is true, you have these options:
change your keys generation algorithm
artificially split problematic keys before mapWithState and combine state snapshots later to make sense for the whole
cap the number of elements with the same key to be processed in each batch, either ignore elements after first N in every batch, or send them elsewhere, into some "can't process in time" Kafka stream and deal with them separately

Does Storm Trident newValueStream after persistentAggregate maintain partition from groupBy

I am currently trying to scale a trident topology that does some post processing after a groupBy and persistentAggregate, using newValueStream to stream values after the aggregate step. I was wondering if the tuples remained partitioned as they were during the groupBy step, or are they redistributed in some other fashion.
relevant code:
.groupBy(new Fields("key"))
.name("GroupBy")
.persistentAggregate(new MemoryMapState.Factory(), new Fields("foo", "bar"), new Aggregator(), new Fields("foobar"))
.newValuesStream()
.name("NewValueStream")

How to write selected columns to Kafka topic?

I am using spark-sql-2.4.1v with java 1.8.
and kafka versions spark-sql-kafka-0-10_2.11_2.4.3 and kafka-clients_0.10.0.0
StreamingQuery queryComapanyRecords =
comapanyRecords
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.option("auto.create.topics.enable", "false")
.option("key.serializer","org.apache.kafka.common.serialization.StringDeserializer")
.option("value.serializer", "com.spgmi.ca.prescore.serde.MessageRecordSerDe")
.option("checkpointLocation", "/app/chkpnt/" )
.outputMode("append")
.start();
queryLinkingMessageRecords.awaitTermination();
Giving error :
Caused by: org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at org.apache.spark.sql.kafka010.KafkaWriter$$anonfun$6.apply(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:121)
I tried to fix as below, but unable to send the value i.e. which is a java bean in my case.
StreamingQuery queryComapanyRecords =
comapanyRecords.selectExpr("CAST(company_id AS STRING) AS key", "to_json(struct(\"company_id\",\"fiscal_year\",\"fiscal_quarter\")) AS value")
.writeStream()
.format("kafka")
.option("kafka.bootstrap.servers",KAFKA_BROKER)
.option("topic", "in_topic")
.start();
So is there anyway in java how to handle/send this value( i.e. Java
bean as record) ??.
Kafka data source requires a specific schema for reading (loading) and writing (saving) datasets.
Quoting the official documentation (highlighting the most important field / column):
Each row in the source has the following schema:
...
value binary
...
In other words, you have Kafka records in the value column when reading from a Kafka topic and you have to make your data to save to a Kafka topic available in the value column as well.
In other words, whatever is or is going to be in Kafka is in the value column. The value column is where you "store" business records (the data).
On to your question:
How to write selected columns to Kafka topic?
You should "pack" the selected columns together so they can all together be part of the value column. to_json standard function is a good fit so the selected columns are going to be a JSON message.
Example
Let me give you an example.
Don't forget to start a Spark application or spark-shell with the Kafka data source. Mind the versions of Scala (2.11 or 2.12) and Spark (e.g. 2.4.4).
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
Let's start by creating a sample dataset. Any multiple-field dataset would work.
val ns = Seq((0, "zero")).toDF("id", "name")
scala> ns.show
+---+----+
| id|name|
+---+----+
| 0|zero|
+---+----+
If we tried to write the dataset to a Kafka topic, it would error out due to value column missing. That's what you faced initially.
scala> ns.write.format("kafka").option("topic", "in_topic").save
org.apache.spark.sql.AnalysisException: Required attribute 'value' not found;
at org.apache.spark.sql.kafka010.KafkaWriter$.$anonfun$validateQuery$6(KafkaWriter.scala:71)
at scala.Option.getOrElse(Option.scala:138)
...
You have to come up with a way to "pack" multiple fields (columns) together and make it available as value column. struct and to_json standard functions will do it.
val vs = ns.withColumn("value", to_json(struct("id", "name")))
scala> vs.show(truncate = false)
+---+----+----------------------+
|id |name|value |
+---+----+----------------------+
|0 |zero|{"id":0,"name":"zero"}|
+---+----+----------------------+
Saving to a Kafka topic should now be a breeze.
vs.write.format("kafka").option("topic", "in_topic").save

Categories