I am trying to access the current status of a data pipeline from Java Data Pipeline client. My use case is to activate a pipeline and wait till it's in completed state.
I tried the answer from this thread: AWS Data Pipeline - Components, Instances and Attempts and Pipeline Status but I am only getting the current state as Scheduled even though the pipeline is in running state. This my code snippet:
DescribePipelinesRequest describePipelinesRequest = new DescribePipelinesRequest();
describePipelinesRequest.setPipelineIds(Arrays.asList(pipelineId));
final DescribePipelinesResult describePipelinesResult =
dataPipelineClient.describePipelines(describePipelinesRequest);
final List<Field> testPipeline =
describePipelinesResult.getPipelineDescriptionList().get(0).getFields();
for (Field field : testPipeline) {
log.debug("Field: {} and {}", field.getKey(), field.getStringValue());
if (field.getKey().equals("#pipelineState")) {
log.debug("Pipeline state current: {} and {}", field.getStringValue());
}
}
Has anyone faced issues like this before? Btw, this pipeline has been made a on trigger pipeline scheduled to run every 100 years. We need to trigger this pipeline manually.
I'm not sure this does exactly what you want but should help to point you in the right direction. You'll need the query the objects in the pipeline and get their status. These are what actually are running.
Java code
String pipelineid = "df-06036888777666777";//replace with your pipeline id
DataPipelineClient client = new DataPipelineClient();
QueryObjectsResult tasks = client.queryObjects( new QueryObjectsRequest().withPipelineId(pipelineid).withSphere("INSTANCE"));
DescribeObjectsResult results = client.describeObjects(new DescribeObjectsRequest().withObjectIds(tasks.getIds()).withPipelineId(pipelineid));
for (PipelineObject obj : results.getPipelineObjects()){
for (Field field : obj.getFields()){
if (field.getKey().equals("#status") && !field.getStringValue().equals("FINISHED") ){
System.out.println(obj.getName() + " is still running...");
}
}
}
OUTPUT:
#CliActivity_2020-01-11T21:34:45 is still running...
#Ec2Instance_2020-01-11T21:34:45 is still running...
what you're doing currently is getting the pipeline information which will only show that it's been created successfully and scheduled.
We need to trigger this pipeline manually.
To do this activate the pipeline again. This will create new task objects which Data Pipeline will start to process. Currently as described this is an on-demand pipeline which will only create new tasks when it's activated manually.
Related
I am new to Vertx and was exploring request-reply using event bus.
I want to implement below flow
User requests for a data
controller sends a message on event bus to a redis-processor verticle
redis-processor will wait for n seconds till value is available in redis (there will be a background process which will keep on refreshing cache, hence the wait)
redis-processor will send reply back to controller
controller responds to user
In short I want to do something like this:
Now I want to implement this in Vertx since vertx can run asynchronously. Using event bus I can isolate controller from processor. So controller can accept multiple user request and stay responsive under load.
(I hope I am right with this!)
I have implemented this in very crude fashion in java-vertx. Stuck in below part.
//receive request from controller
vertx.eventBus().consumer(REQUEST_PROCESSOR, evtHandler -> {
String txnId = evtHandler.body().toString();
LOGGER.info("Received message:: {}", txnId);
this.redisAPI.get(txnId, result -> { // <=====
String value = result.result().toString();
LOGGER.info("Value in redis : {}", value);
evtHandler.reply(value); // reply to controller
});
});
pls see line denoted by arrow. How can I wait for x seconds without blocking event loop?
Please help.
Thats actually very simple, you need a timer. Please see docs for details but you will need more or less something like this:
vertx.setTimer(1000, id -> {
this.redisAPI.get(txnId, result -> {
String value = result.result().toString();
LOGGER.info("Value in redis : {}", value);
evtHandler.reply(value); // reply to controller
});
});
You might want to store the timer IDs somewhere so that you can cancel them or that at least you know something is running when a shutdown request comes in for your verticle to delay it. But this all depends on your needs.
As #mohamnag said, you could use a Vertx timer
here is another example on how to user timer.
Note that the timer value is in ms.
As an improvement to the, I will recommend checking that the callback has succeeded before attempting to get the value from redisAPI. This is done using the succeeded() method.
In an asynchronous environment getting that result could fail due to several issues (network errors etc)
vertx.setTimer(n * 1000, id -> {
this.redisAPI.get(txnId, result -> {
if(result.succeeded()){ // the callback succeeded to get a value from redis
String value = result.result().toString();
LOGGER.info("Value in redis : {}", value);
evtHandler.reply(value); // reply to controller
} else {
LOGGER.error("Value could not be gotten from redis : {}", result.cause());
evtHandler.fail(someIntegerCode, result.cause()); // reply with failure related info
}
});
});
I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help
The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.
We recently upgraded Kafka to v1.1 and Confluent to v4.0.But upon upgrading we have encountered a persistent problems regarding state stores. Our application starts a collection of streams and we check for the state stores to be ready before killing the application after 100 tries. But after the upgrade there's atleast one stream that will have Store is not ready : the state store, <your stream>, may have migrated to another instance
The stream itself has RUNNING state and the messages will flow through but the state of the store still shows up as not ready. So I have no idea as to what may be happening.
Should I not check for store state?
And since our application has a lot of streams (~15), would starting
them simultaneously cause problems?
Should we not do a hard restart -- currently we run it as a service
on linux
We are running Kafka in cluster with 3 brokers.Below is a sample stream (not the entire code):
public BaseStream createStreamInstance() {
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
MessagePayLoadParser<Note> noteParser = new MessagePayLoadParser<Note>(Note.class);
GenericJsonSerde<Note> noteSerde = new GenericJsonSerde<Note>(Note.class);
StreamsBuilder builder = new StreamsBuilder();
//below reducer will use sets to combine
//value1 in the reducer is what is already present in the store.
//value2 is the incoming message and for notes should have max 1 item in it's list (since its 1 attachment 1 tag per row, but multiple rows per note)
Reducer<Note> reducer = new Reducer<Note>() {
#Override
public Note apply(Note value1, Note value2) {
value1.merge(value2);
return value1;
}
};
KTable<Long, Note> noteTable = builder
.stream(this.subTopic, Consumed.with(jsonSerde, jsonSerde))
.map(noteParser::parse)
.groupByKey(Serialized.with(Serdes.Long(), noteSerde))
.reduce(reducer);
noteTable.toStream().to(this.pubTopic, Produced.with(Serdes.Long(), noteSerde));
this.stream = new KafkaStreams(builder.build(), this.properties);
return this;
}
There are some open questions here, like the ones Matthias put on comment, but will try to answer/give help to your actual questions:
Should I not check for store state?
Rebalancing is usually the case here. But in that case, you should not see that partition's thread keep consuming, but that processing should be "transferred" to be done to another thread that took over. Make sure if it is actually that very thread the one that keeps on processing that partition, and not the new one. Check kafka-consumer-groups utility to follow the consumers (threads) there.
And since our application has a lot of streams (~15), would starting them simultaneously cause problems? No, rebalancing is automatic.
Should we not do a hard restart -- currently we run it as a service on linux Are you keeping your state stores in a certain, non-default directory? You should configure your state stores directory properly and make sure it is accessible, insensitive to application restarts. Unsure about how you perform your hard restart, but some exception handling code should cover against it, closing your streams application.
How would it be possible to stop a specific running job in Kettle?
I'm using the following code:
KettleEnvironment.init();
JobMeta jobmeta = new JobMeta(C://Users//Admin//DBTOOL//EDW_Testing_Tool - 1.8(VersionUpgraded)//data-integration//Regress_bug//Start_Validation.kjb,
null);
Job job = new Job(null, jobmeta);
job.initializeVariablesFrom(null);
job.setVariable("Internal.Job.Filename.Directory", Constants.JOB_EXECUTION_KJB_FILE_PATH);
job.setVariable("jobId", jobId.toString());
job.getJobMeta().setInternalKettleVariables(job);
job.stopAll();
How would I ensure that the job which I want to stop is getting stopped and it is not executed after setting the flag?
I'm using rest api to stop the job and i'm not able to get the job Object.
if i'm using CarteSingleton and store the object in map i'm not able to execute the job it gives driver error could not connect to database(eg:-jtds) url not working.
Is it possible to wait until a BatchJob (BatchRequest objecT) in GCP is completed?
I.g. you can do it with a normal Job:
final Job job = createJob(jobId, projectId, datasetId, tableId, destinationBucket);
service.jobs().insert(projectId, job).execute();
final Get request = service.jobs().get(projectId, jobId);
JobStatus response;
while (true) {
Thread.sleep(500); // improve this sleep policy
response = request.execute().getStatus();
if (response.getState().equals("DONE") || response.getState().equals("FAILED"))
break;
}
Something like the above code works fine. The problem with batchRequest is that the jobRequest.execute() method does not return a Response object.
When you execute it, the batch request returns after it has initialised all the jobs specified in its queue but it does not wait until all of them are really finished. Indeed your execute() method returns but you can have failing jobs later on (i.g. error due to quota issue, schema issues etc.) and I can't notify the client on time with the right information.
You can just check the status of all the created jobs in the web UI with the job history button from the BigQuery view, you can't return error message to a client.
Any idea with that?