I am new to spark, i have the basic idea of how the transformation and action work (guide). I am trying some NLP operation on each line (basically paragraphs) in a text file. After processing, the result should be sent to a server (REST Api) for storage. The program is run as a spark job (submitted using spark-submit) on a cluster of 10 nodes in yarn mode. This is what i have done so far.
...
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> processedLines = lines
.map(line -> {
// processed here
return result;
});
processedLines.foreach(line -> {
// Send to server
});
This works but the foreach loop seems sequential, it seems like it is not running in distributed mode on the worker nodes. Am i correct?
I tried the following code but it doesn't work. Error: java: incompatible types: inferred type does not conform to upper bound(s). Obviously its wrong because map is a transformation, not an action.
lines.map(line -> { /* processing */ })
.map(line -> { /* Send to server */ });
I also tried with take(), but it requires int and the processedLines.count() is of type long.
processedLines.take(processedLines.count()).forEach(pl -> { /* Send to server */ });
The data is huge (greater than 100gb). What i want is that both the processing and sending it to the server should be done on the worker nodes. The processing part in the map defiantly takes place on the worker nodes. But how do i send the processed data from the worker nodes to the server because the foreach seems sequential loop taking place in the driver (if i am correct). Simply put, how to execute action in the worker nodes and not in the driver program.
Any help will be highly appreciated.
foreach is an action in spark. It basically takes each element of the RDD and applies a function to that element.
foreach is performed on the executor nodes or worker nodes. It does not get applied on the driver node. Note that in the local execution mode of running spark both driver and executor node can reside on the same JVM.
Check this for reference foreach explanation
Your approach looks ok where you are trying to map each element of RDD and then apply foreach to each element. The reason which I can think of why it is taking time is because of the data size that you are dealing with(~100GB).
One way of doing the optimization to this is to repartition the input data set. Ideally each partition should be of size 128MB for better performance results. There are many articles which you will find about best practices for doing the repartition of the data. I would suggest you follow them, It will give some performance benefit.
The second optimization which you can think of doing is the memory that you assign to each executor node. It plays a very important role while doing spark tuning.
The third optimization that you can think of is, batch the network call to the server. You are currently doing network calls to the server for each element of the RDD. If your design allows you to batch these network calls, where you can send more than 1 element in a single network call. This might help as well if the latency produced is majorly due to these network calls.
I hope this helps.
Firstly when your code is running on Executors its already in distributed mode now when you want to utilize all the CPU resources on Executors for more parallelism you should go for some async options and more preferrably with batch mode operation to avoid excess creation of Client connection objects as below.
You can replace your code with
processedLines.foreach(line -> {
with either of the solution
processedLines.foreachAsync(line -> {
// Send to server
}).get();
//To iterate batch wise I would go for this
processedLines.foreachPartitionAsync(lineIterator -> {
// Create your ouput client connection here
while (lineIterator.hasNext()){
String line = lineIterator.next();
}
}).get();
Both the function will create a Future object or submit a new thread or a unblocking call which would automatically add parallelism to your code.
Related
In my java 8 spring boot application, I have a list of 40000 records. For each record, I have to call an external API and save the result to DB. How can I do this with better performance within no time? Each of the API calls will take about 20 secs to complete. I used a parallel stream for reducing the time but there was no considerable change in it.
if (!mainList.isEmpty()) {
AtomicInteger counter = new AtomicInteger();
List<List<PolicyAddressDto>> secondList =
new ArrayList<List<PolicyAddressDto>>(
mainList.stream()
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / subArraySize))
.values());
for (List<PolicyAddressDto> listOfList : secondList) {
listOfList.parallelStream()
.forEach(t -> {
callAtheniumData(t, listDomain1, listDomain2); // listDomain2 and listDomain1 declared
// globally
});
if (!listDomain1.isEmpty()) {
listDomain1Repository.saveAll(listDomain1);
}
if (!listDomain2.isEmpty()) {
listDomain2Repository.saveAll(listDomain2);
}
}
}
Solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.
As I can see the api call response is not being saved.
Also all api calls are disjoint with respect to each other.
Can we try creating new threads for each api call.
for (List<PolicyAddressDto> listOfList : secondList) {
listOfList.parallelStream()
.forEach(t -> {
new Thread(() ->{callAtheniumData(t, listDomain1, listDomain2)}).start();
});
}
That's because the parallel stream divide the task usually creating one thread per core -1. If every call you do to the external API takes 20 seconds and you have 4 core, this means 3 concurrent requests that wait for 20 seconds.
You can increase the concurrency of your calls in this way https://stackoverflow.com/a/21172732/574147 but I think you're just moving the problems.
An API that takes 20sec it's a really slow "typical" response time. If this is a really complex elaboration and CPU bounded, how can that service be able to respond at 10 concurrent request keeping the same performance? Probably it wouldn't.
Otherwise if the elaboration is "IO bounded" and takes 20 seconds, you probably need a service able to take (and work!) with list of elements
Each of the API calls will take about 20 secs to complete.
Your external API is where you are being bottlenecked. There's really nothing your code can do to speed it up on the client side except to parallelize the process. You've already done that, so if the external API is within your organization, you need to look into any performance improvements there. If not, can do something like offload the processing via Kafka to Apache NiFi or Streamsets so that your Spring Boot API doesn't have to wait for hours to process the data.
I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.
Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)
The dynamodb documentation says that there are shards and they are needed to be iterated first, then for each shard it is needed to get number of records.
The documentation also says:
(If you use the DynamoDB Streams Kinesis Adapter, this is handled for you: Your application will process the shards and stream records in the correct order, and automatically handle new or expired shards, as well as shards that split while the application is running. For more information, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.)
Ok, But I use lambda not kinesis (ot they relates to each other?) and if a lambda function is attached to dynamodb stream should I care about shards ot not? Or I should just write labda code and expect that aws environment pass just some records to that lambda?
When using Lambda to consume a DynamoDB Stream the work of polling the API and keeping track of shards is all handled for you automatically. If your table has multiple shards then multiple Lambda functions will be invoked. From your prospective as a developer you just have to write the code for your Lambda function and the rest is taken care for you.
In-order processing is still guaranteed by DynamoDB streams so with a single shard will have only one instance of your Lambda function will be invoked at a time. However, with multiple shards you may see multiple instances of your Lambda function running at the same time. This fan-out is transparent and may cause issues or lead to surprising behaviors if you are not aware of it while coding your Lambda function.
For a deeper explanation of how this works I'd recommend the YouTube video AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301). While the focus is mostly on Kinesis Streams the same concepts for consuming DynamoDB Streams apply as the technology is nearly identical.
We use DynamoDB to process close to billion of records everyday and autoexpire those records and send to streams.
Everything is taken care by AWS and we don't need to do anything, except configuring streams (what type of image you want) and adding triggers.
The only fine tuning we did is,
When you get more data, we just increased the batch size to process faster and reduce the overhead on the number of calls to Lambda.
If you are using any external process to iterate over the stream, you might need to do the same.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Hope it helps.
I came across the following code which processes messages in Spark Streaming:
val listRDD = ssc.socketTextStream(host, port)
listRDD.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
// Should I start a separate thread for each RDD and/or Partition?
partition.foreach(message => {
Processor.processMessage(message)
})
})
})
This is working for me but I am not sure if this is the best way. I understand that a DStream consists of "one to many" RDDs, but this code processes RDDs sequentially one after the other, right? Isn't there a better way - a method or function - that I can use so that all the RDDs in the DStream get processed in parallel? Should I start a separate thread for each RDD and/or Partition? Have I misunderstood how this code works under Spark?
Somehow I think this code is not taking advantage of the parallelism in Spark.
Streams are partitioned in small RDDs for convenience and efficiency (check micro-batching. But you really don't need to break every RDD into partitions or even break the stream into RDDs.
It all depends on what Processor.processMessage really is. If it is a single transformation function, you can just do listRDD.map(Processor.processMessage) and you get a stream of whatever the result of processing a message is, computed in parallel with no need for you to do much else.
If Processor is a mutable object that holds state (say, counting the number of messages) then things are more complicated, as you will need to define many such objects to account for parallelism and will also need to somehow merge results later on.
I have a DStream "Crowd" and I want to write each element in "Crowd" to a socket. When I try to read from that socket, it dosen't print anything. I am using following line of code:
val server = new ServerSocket(4000,200);
val conn = server.accept()
val out = new PrintStream(conn.getOutputStream());
crowd.foreachRDD(rdd => {rdd.foreach(record=>{out.println(record)})})
But if use (this is not I want though):
crowd.foreachRDD(rdd => out.println(rdd))
It does write something to the socket.
I suspect there is a problem with using rdd.foreach(). Although it should work. I am not sure what I am missing.
The code outside the DStream closure is executed in the driver, while the rdd.foreach(...) will be executed on each distributed partition of the RDD.
So, there's a socket created on the driver's machine and the job tries to write to it on another machine - that will not work for the obvious reasons.
DStream.foreachRDD is executed on the driver, so in that instance, the socket and the computation are performed in the same host. Therefore it works.
With the distributed nature of an RDD computation, this Server Socket approach will be hard to make work as dynamic service discovery becomes a challenge i.e. "where is my server socket open?". Look into some system that will allow you to have centralized access to distributed data. Kafka is a good alternative for this kind of streaming process.
Here in the official documentation you have the answer!
You have to create the connection inside of the foreachRDD function, and if you want to do it optimally you need to create a "pool" of connections, and then bring the connection you want inside of the foreachPartition function, and call to the foreach function to send the elements through that connection. This is the example code for doing it in the best way:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
In any case, check the other comments as they provide good knowledge about the context of the problem.
crowd.foreachRDD(rdd => {rdd.collect.foreach(record=>{out.println(record)})})
Your suggested code in your comments will work fine but in this case you have to collect all records of RDD in driver. If number of records are small that will be ok but if number of records are larger than the driver's memory that will be become bottle neck. Your first attempt should always process data on client. Remember RDD is distributed on worker machines so that means first you need to bring all records in RDD to driver resulting in increased communication which is a kill in distributed computing. So as stated your code will only be ok when there are limited records in RDD.
I am working on similar problems and I have been searching how to pool connections and serialize them to client machines. If some body has any answers to that, will be great.