I use spark streaming to get data from a queue in mq via a custom receiver.
Javastreaming context duration is 10 seconds.
And there is one task defined for the input from queue.
In event time line in spark UI, I see a job getting submitted in each 10s interval even when there is no data from the receiver.
Is it the normal behavior or how to stop jobs getting submitted when there is no data.
JavaDStream<String> customReceiverStream = ssc.receiverStream(newJavaCustomReceiver(host, port));
JavaDStream<String> words =lines.flatMap(new FlatMapFunction<String, String>() { ... });
words.print();
ssc.start();
ssc.awaitTermination();
As a work around
You can use livy to submit the spark jobs(use java codes instead of cli commands).
Livy job would be constantly checking a database that would have an indicator whether the data is flowing in or not.As soon as the data flow stops,change the indicator in the database and this would result in spark job being killed by the livy.(Use Livy sessions)
Related
We have a java/spring application which runs on EKS pods and we have records stored in MongoDB collection.
STATUS: READY,STARTED,COMPLETED
Application needs to pick the records which are in READY status and update the status to STARTED. Once the processing of the record is completed, the status will be updated to COMPLETED
Once the record is STARTED, it may take few hours to complete, until then other pods(other instance of the same app) should not pick this record. If some exception occurs, the app changes the status to READY so that other pods(or the same pod) can pick the READY record for processing.
Requirement: If the pod crashes when the record is processing(STARTED) but crashes before changing the status to READY/COMPLETED, the other pod should be able to pick this record and start processing again.
We have some solution in mind but trying to find the best solution. Request you to help me with some best approaches.
You can use a shutdown hook from spring:
#Component
public class Bean1 {
#PreDestroy
public void destroy() {
## handle database change
System.out.println(Status changed to ready);
}
}
Beyond that, that kind of job could run better in a messaging architecture, using SQS for example. Instead of using the status on the database to handle and orchestrate the task, you can use an SQS, publish the message that needs to be consumed (the messages that were in ready state) and have a poll of workers consuming messages from this SQS, if something crashes or the pod of this workers needs to be reclaimed, the message goes back to SQS and can be consumed by another pod.
I have streaming job running which will run forever and will execute the query on Kafka topic, I am going through DataProc Documentation for submitting a job via Java, here is the link
// Submit an asynchronous request to execute the job.
OperationFuture<Job, JobMetadata> submitJobAsOperationAsyncRequest =
jobControllerClient.submitJobAsOperationAsync(projectId, region, job);
Job response = submitJobAsOperationAsyncRequest.get();
For the above line of code I am not able to get the response , the above code keeps on running ? Is it because it's streaming job and it's running forever ?
How I can get a response ? So to end user I can provide some job information like URL where they can see their Jobs or any monitoring dashaborad
The OperationFuture<Job, JobMetadata> class has a getMetadata() method which returns a com.google.api.core.ApiFuture<JobMetadata>. You can get the job metadata before the job finishes by calling jobMetadataApiFuture.get().
See more details about the OperationFuture.getMetadata() method and the ApiFuture class.
When I upload a file to s3 bucket a event is triggered and a AWS batch job is started. Is there any way to check the status of AWS batch job in my java code. I have to perform some operation when the status of AWS batch job is SUCCEEDED.
You have the choice of using the ListJobs / DescribeJobs APIs to poll for status.
ListJobsResult listJobs(ListJobsRequest listJobsRequest) Returns a
list of AWS Batch jobs.
You must specify only one of the following items:
A job queue ID to return a list of jobs in that job queue
A multi-node parallel job ID to return a list of that job's nodes
An array job ID to return a list of that job's children
You can filter the results by job status with the jobStatus parameter.
If you don't specify a status, only RUNNING jobs are returned.
Or you can listen for the CloudWatch Events which are emitted as jobs transition from one state to another if you prefer an event-driven architecture.
ListJobsRequest
For solving this problem, I have created separate thread callable thread where looped until status of Job is SUCCEDED and FAILED. Extracted the job status based on job id using describe job API.
class ReturnJobStatus implements Callable<String>
{
public String Callable()
{
while(!(jobStatus.equals("SUCCEEDED") || (jobStatus.equals("FAILED")))
{
// extracts job status using describeJob API after passing jobId
Thread.currentThread().sleep(2000);
}
return jobStatus;
}
I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.
We have a storm topology in which we configured one spout and two bolts. Spout queries data from DB continuously and send tuples it to first bolt for some processing. First bolt does some processing and send tuples it to second bolt which calls third party web service and sends data. So, what is happening after some time, last bolt is not getting any tuples and if we restart the topology it works fine. Only last bolt is in problem here. Other spout and first bolt are running fine, and I am not using acking framework. I have configured only one worker in this case`.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("messageListenrSpout", new MessageListenerSpout(), 1);
builder.setBolt("processorBolt", new ProcessorBolt(), 20).shuffleGrouping("messageListenrSpout");
builder.setBolt("notifierBolt", new NotifierBolt(),40).shuffleGrouping("processorBolt");
Config conf = new Config();
conf.put(Config.TOPOLOGY_SLEEP_SPOUT_WAIT_STRATEGY_TIME_MS, 10000);
//conf.setMessageTimeoutSecs(600);
conf.setDebug(true);
StormSubmitter.submitTopology(TOPOLOGY, conf, builder.createTopology());
It's quite likely that you're having problems with a backlog of tuples causing timeouts. Try increasing the parallelism hint for the 2nd bolt since it sounds like that one's process time is much longer than that of the first bolt (that's why there would be a backlog into the 2nd bolt). If you're running this topology on the cluster look at the Storm UI to see the specifics.
Guys when I was debugging my topology, I found that if let's say spout is sending message fast but bolt is processing slow. In this case, message will be queued up LMAX Disruptor Queue. Then spout task wait for that to be empty. If you take thread dump, you will find threads are in TIMED_WAITING state. So, we need to configure topology in such a way that its inflow and outflow maintained.