Processing RDDs in a DStream in parallel - java

I came across the following code which processes messages in Spark Streaming:
val listRDD = ssc.socketTextStream(host, port)
listRDD.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
// Should I start a separate thread for each RDD and/or Partition?
partition.foreach(message => {
Processor.processMessage(message)
})
})
})
This is working for me but I am not sure if this is the best way. I understand that a DStream consists of "one to many" RDDs, but this code processes RDDs sequentially one after the other, right? Isn't there a better way - a method or function - that I can use so that all the RDDs in the DStream get processed in parallel? Should I start a separate thread for each RDD and/or Partition? Have I misunderstood how this code works under Spark?
Somehow I think this code is not taking advantage of the parallelism in Spark.

Streams are partitioned in small RDDs for convenience and efficiency (check micro-batching. But you really don't need to break every RDD into partitions or even break the stream into RDDs.
It all depends on what Processor.processMessage really is. If it is a single transformation function, you can just do listRDD.map(Processor.processMessage) and you get a stream of whatever the result of processing a message is, computed in parallel with no need for you to do much else.
If Processor is a mutable object that holds state (say, counting the number of messages) then things are more complicated, as you will need to define many such objects to account for parallelism and will also need to somehow merge results later on.

Related

Apache Spark take Action on Executors in fully distributed mode

I am new to spark, i have the basic idea of how the transformation and action work (guide). I am trying some NLP operation on each line (basically paragraphs) in a text file. After processing, the result should be sent to a server (REST Api) for storage. The program is run as a spark job (submitted using spark-submit) on a cluster of 10 nodes in yarn mode. This is what i have done so far.
...
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> processedLines = lines
.map(line -> {
// processed here
return result;
});
processedLines.foreach(line -> {
// Send to server
});
This works but the foreach loop seems sequential, it seems like it is not running in distributed mode on the worker nodes. Am i correct?
I tried the following code but it doesn't work. Error: java: incompatible types: inferred type does not conform to upper bound(s). Obviously its wrong because map is a transformation, not an action.
lines.map(line -> { /* processing */ })
.map(line -> { /* Send to server */ });
I also tried with take(), but it requires int and the processedLines.count() is of type long.
processedLines.take(processedLines.count()).forEach(pl -> { /* Send to server */ });
The data is huge (greater than 100gb). What i want is that both the processing and sending it to the server should be done on the worker nodes. The processing part in the map defiantly takes place on the worker nodes. But how do i send the processed data from the worker nodes to the server because the foreach seems sequential loop taking place in the driver (if i am correct). Simply put, how to execute action in the worker nodes and not in the driver program.
Any help will be highly appreciated.
foreach is an action in spark. It basically takes each element of the RDD and applies a function to that element.
foreach is performed on the executor nodes or worker nodes. It does not get applied on the driver node. Note that in the local execution mode of running spark both driver and executor node can reside on the same JVM.
Check this for reference foreach explanation
Your approach looks ok where you are trying to map each element of RDD and then apply foreach to each element. The reason which I can think of why it is taking time is because of the data size that you are dealing with(~100GB).
One way of doing the optimization to this is to repartition the input data set. Ideally each partition should be of size 128MB for better performance results. There are many articles which you will find about best practices for doing the repartition of the data. I would suggest you follow them, It will give some performance benefit.
The second optimization which you can think of doing is the memory that you assign to each executor node. It plays a very important role while doing spark tuning.
The third optimization that you can think of is, batch the network call to the server. You are currently doing network calls to the server for each element of the RDD. If your design allows you to batch these network calls, where you can send more than 1 element in a single network call. This might help as well if the latency produced is majorly due to these network calls.
I hope this helps.
Firstly when your code is running on Executors its already in distributed mode now when you want to utilize all the CPU resources on Executors for more parallelism you should go for some async options and more preferrably with batch mode operation to avoid excess creation of Client connection objects as below.
You can replace your code with
processedLines.foreach(line -> {
with either of the solution
processedLines.foreachAsync(line -> {
// Send to server
}).get();
//To iterate batch wise I would go for this
processedLines.foreachPartitionAsync(lineIterator -> {
// Create your ouput client connection here
while (lineIterator.hasNext()){
String line = lineIterator.next();
}
}).get();
Both the function will create a Future object or submit a new thread or a unblocking call which would automatically add parallelism to your code.

Flink Consumer with DataStream API for Batch Processing - How do we know when to stop & How to stop processing [ 2 fold ]

I am basically trying to use the same Flink pipeline (of transformations, with different input parameters to distinguish between real-time and batch modes) to run it in Batch Mode & realtime mode. I want to use the DataStream API, as most of my transformations are dependent on DataStream API.
My Producer is Kafka & real time pipeline works just fine. Now I want to build a Batch pipeline with the same exact code with different topics for batch & real-time mode. How does my batch processor know when to stop processing?
One way I thought of was to add an extra parameter in the Producer record to say this is the last record, however, given multi partitioned topics, record delivery across multiple partitions does not guarantee the order (delivery inside one partition is guaranteed though).
What is the best practice to design this?
PS: I don't want to use DataSet API.
You can use the DataStream API for batch processing without any issue. Basically, Flink will inject the barrier that will mark the end of the stream, so that Your application will work on finite streams instead of infinite ones.
I am not sure if Kafka is the best solution for the problem to be completely honest.
Generally, when implementing KafkaDeserializationSchema You have the method isEndOfStream() that will mark that the stream has finished. Perhaps, You could inject the end markers for each partition and simply check if all of the markers have been read and then finish the stream. But this would require You to know the number of partitions beforehand.

Kafka Streams windowing aggregation batching

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.
Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

Java8 stream().map().reduce() is really map reduce

I saw this code somewhere using stream().map().reduce().
Does this map() function really works parallel? If Yes, then how many maximum number of threads it can initiate for map() function?
What if I use parallelStream() instead of just stream() for the below particular use-case.
Can anyone give me good example of where to NOT use parallelStream()
Below code is just to extract tName from tCode and returns comma separated String.
String ts = atList.stream().map(tcode -> {
return CacheUtil.getTCache().getTInfo(tCode).getTName();
}).reduce((tName1, tName2) -> {
return tName1 + ", " + tName2;
}).get();
this stream().map().reduce() is not parallel, thus a single thread acts on the stream.
you have to add parallel or in other cases parallelStream (depends on the API, but it's the same thing). Using parallel by default you will get number of available processors - 1; but the main thread is used too in the ForkJoinPool#commonPool; thus there will be usually 2, 4, 8 threads etc. To check how many you will get, use:
Runtime.getRuntime().availableProcessors()
You can use a custom pool and get as many threads as you want, as shown here.
Also notice that the entire pipeline is run in parallel, not just the map operation.
There isn't a golden law about when to use and when not to use parallel streams, the best way is to measure. But there are obvious choices, like a stream of 10 elements - this is way too little to have any real benefit from parallelization.
All parallel streams use common fork-join thread pool and if you submit a long-running task, you effectively block all threads in the pool. Consequently you block all other tasks that are using parallel streams.
There are only two options how to make sure that such thing will never happen. The first is to ensure that all tasks submitted to the common fork-join pool will not get stuck and will finish in a reasonable time. But it's easier said than done, especially in complex applications. The other option is to not use parallel streams and wait until Oracle allows us to specify the thread pool to be used for parallel streams.
Use case
Lets say you have a collection (List) which gets loaded with values at the start of application and no new value is added to it at any later point. In above​ scenario you can use parallel stream without any concerns.
Don't worry stream is efficient and safe.

How to parallelize an RDD of LinkedLists?

I am developing an application in Spark, using the Spark Streaming Framework.
Right now my aim is to learn how parallellization works in Spark, and how I can use it to speed up my input data processing.
Here is my question:
I have a DStream that in each Batch Interval has an RDD in which only one partition has data, 4 LinkedLists within that partition to be precise (I am not sure of how many partitions the RDD has exactly, perhaps 4 given the number of cores in my pc, since I am running in local mode).
I use the following to try and parallelize my RDD:
JavaDStream<LinkedList<Integer>> rddWithPartitions=rddWithDataInOnePartition.repartition(4);
That is, with this, I intend to parallellize my RDD so that it has one LinkedList per partition, and not four in one single partition.
When I do a rddWithPartitions.print(), I indeed see what I think are 4 partitions filled with data, but when I go to the Spark UI, namely the Executors, I only see one, meaning (I think) that I am only using one Worker, and thus parallellization wasn't achieved.
I do have more than one task (although it is three and not four, as I thought would be the case), but I am not sure if I am using all four cores from my pc, each one processing one partition of my RDD.
How can I make sure that I achieved this parallellization?
I hope I was not confusing.
Thank you so much.

Categories