The problem:
I have three instances of a java application running in Kubernetes. My application uses Apache Camel to read from a Kinesis stream. I'm currently observing two related issues:
Each of the three running instances of my application is processing the records coming into the stream, when I only want each record to be processed once (I want three up and running for scaling purposes). I was hoping that while one instance is processing record A, a second could be picking up record B, etc.
Every time my application is re-deployed in Kubernetes, each instance starts every record all over again (in other words, it has no idea where it left off or which records have previously been processed).
After 5 minutes, the shard iterator that my application is using to poll kinesis times out. I know that this is normal behavior, but what I don't understand is why my application is not grabbing a new iterator. This screenshot shows the error from DataDog.
What I've tried:
First off, I believe that this issue is caused by inconsistent shard iterator ids, and kinesis consumer ids across the three instances of my application, and across deploys. However, I have been unable to locate where these values are set in code, and how I could go about setting them. Of course, there may also be a better solution altogether. I have found very little documentation on Kinesis/Kubernetes/Camel working together, and so very little outside sources have been helpful.
The documentation on AWS Kinesis :: Apache Camel is very limited, but what I have tried playing around with the iterator type and building a custom Client Configuration.
Let me know if you need any additional information, thanks.
Configuring the client:
main.bind("kinesisClient", AmazonKinesisClientBuilder.defaultClient());
.
.
.
inputUri = String.format("aws-kinesis://%s?amazonKinesisClient=#kinesisClient", rawKinesisName);
main.configure().addRoutesBuilder(new RawDataRoute(inputUri, inputTransform));
My route:
public class RawDataRoute extends RouteBuilder {
private static final Logger LOG = new Logger(RawDataRoute.class, true);
private String rawDataStreamUri;
private Expression transform;
public RawDataRoute(final String rawDataStreamUri, final Expression transform) {
this.rawDataStreamUri = rawDataStreamUri;
this.transform = transform;
}
#Override
public void configure() {
// TODO add error handling
from(rawDataStreamUri)
.routeId("raw_data_stream")
.transform(transform)
.to("direct:main_input_stream");
}
}
Related
I am unable to read in batch with the kafka camel consumer, despite following an example posted here. Are there changes I need to make to my producer, or is the problem most likely with my consumer configuration?
The application in question utilizes the kafka camel component to ingest messages from a rest endpoint, validate them, and place them on a topic. I then have a separate service that consumes them from the topic and persists them in a time-series database.
The messages were being produced and consumed one at a time, but the database expects the messages to be consumed and committed in batch for optimal performance. Without touching the producer, I tried adjusting the consumer to match the example in the answer to this question:
How to transactionally poll Kafka from Camel?
I wasn't sure how the messages would appear, so for now I'm just logging them:
from(kafkaReadingConsumerEndpoint).routeId("rawReadingsConsumer").process(exchange -> {
// simple approach to generating errors
String body = exchange.getIn().getBody(String.class);
if (body.startsWith("error")) {
throw new RuntimeException("can't handle the message");
}
log.info("BODY:{}", body);
}).process(kafkaOffsetManager);
But the messages still appear to be coming across one at a time with no batch read.
My consumer config is this:
kafka:
host: myhost
port: myport
consumer:
seekTo: beginning
maxPartitionFetchBytes: 55000
maxPollRecords: 50
consumerCount: 1
autoOffsetReset: earliest
autoCommitEnable: false
allowManualCommit: true
breakOnFirstError: true
Does my config need work, or are there changes I need to make to the producer to have this work correctly?
At the lowest layer, the KafkaConsumer#poll method is going to return an Iterator<ConsumerRecord>; there's no way around that.
I don't have in-depth experience with Camel, but in order to get a "batch" of records, you'll need some intermediate collection to "queue" the data that you want to eventually send downstream to some "collection consumer" process. Then you will need some "switch" processor that says "wait, process this batch" or "continue filling this batch".
As far as databases go, that process is exactly what Kafka Connect JDBC Sink does with batch.size config.
We solved a similar requirement by using the Aggregation [1] capability provided by Camel
A rough code snippet
#Override
public void configure() throws Exception {
// 1. Define your Aggregation Strat
AggregationStrategy agg = AggregationStrategies.flexible(String.class)
.accumulateInCollection(ArrayList.class)
.pick(body());
from("kafka:your-topic?and-other-params")
// 2. Define your Aggregation Strat Params
.aggregate(constant(true), agg)
.completionInterval(1000)
.completionSize(100)
.parallelProcessing(true)
// 3. Generate bulk insert statement
.process(exchange -> {
List<String> body = (List<String>) exchange.getIn().getBody();
String query = generateBulkInsertQueryStatement("target-table", body);
exchange.getMessage().setBody(query);
})
.to("jdbc:dataSource");
}
There are a variety of strategies that you can implement, but we chose this particular one because it allows you to create a List of strings for the message contents that we need to ingest into the db. [2]
We set a variety of different params such as completionInterval & completionSize. The most important one for us was to set parallellProcessing(true) [3] ; without that our performance wasn't nearly getting the required throughput.
Once the aggregation has either collected 100 messages or 1000 ms has passed, then the processor generates a bulk insert statement, which then gets sent to the db.
[1] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html
[2] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_aggregating_into_a_list
[3] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_worker_pools
We recently upgraded Kafka to v1.1 and Confluent to v4.0.But upon upgrading we have encountered a persistent problems regarding state stores. Our application starts a collection of streams and we check for the state stores to be ready before killing the application after 100 tries. But after the upgrade there's atleast one stream that will have Store is not ready : the state store, <your stream>, may have migrated to another instance
The stream itself has RUNNING state and the messages will flow through but the state of the store still shows up as not ready. So I have no idea as to what may be happening.
Should I not check for store state?
And since our application has a lot of streams (~15), would starting
them simultaneously cause problems?
Should we not do a hard restart -- currently we run it as a service
on linux
We are running Kafka in cluster with 3 brokers.Below is a sample stream (not the entire code):
public BaseStream createStreamInstance() {
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
MessagePayLoadParser<Note> noteParser = new MessagePayLoadParser<Note>(Note.class);
GenericJsonSerde<Note> noteSerde = new GenericJsonSerde<Note>(Note.class);
StreamsBuilder builder = new StreamsBuilder();
//below reducer will use sets to combine
//value1 in the reducer is what is already present in the store.
//value2 is the incoming message and for notes should have max 1 item in it's list (since its 1 attachment 1 tag per row, but multiple rows per note)
Reducer<Note> reducer = new Reducer<Note>() {
#Override
public Note apply(Note value1, Note value2) {
value1.merge(value2);
return value1;
}
};
KTable<Long, Note> noteTable = builder
.stream(this.subTopic, Consumed.with(jsonSerde, jsonSerde))
.map(noteParser::parse)
.groupByKey(Serialized.with(Serdes.Long(), noteSerde))
.reduce(reducer);
noteTable.toStream().to(this.pubTopic, Produced.with(Serdes.Long(), noteSerde));
this.stream = new KafkaStreams(builder.build(), this.properties);
return this;
}
There are some open questions here, like the ones Matthias put on comment, but will try to answer/give help to your actual questions:
Should I not check for store state?
Rebalancing is usually the case here. But in that case, you should not see that partition's thread keep consuming, but that processing should be "transferred" to be done to another thread that took over. Make sure if it is actually that very thread the one that keeps on processing that partition, and not the new one. Check kafka-consumer-groups utility to follow the consumers (threads) there.
And since our application has a lot of streams (~15), would starting them simultaneously cause problems? No, rebalancing is automatic.
Should we not do a hard restart -- currently we run it as a service on linux Are you keeping your state stores in a certain, non-default directory? You should configure your state stores directory properly and make sure it is accessible, insensitive to application restarts. Unsure about how you perform your hard restart, but some exception handling code should cover against it, closing your streams application.
I am trying to use spring-kafka 1.3.x (1.3.3 and 1.3.4). What is not clear is whether there is a safe way to consume messages in batch without skipping a message (or set of messages) when an exception occurs eg network outage. My preference is also to leverage the container capabilities as much as possible to remain in Spring framework rather than trying to create a custom framework for dealing with this challenge.
I am setting the following properties onto a ConcurrentMessageListenerContainer :
.setAckOnError(false);
.setAckMode(AckMode.MANUAL);
I am also setting the following kafka specific consumer properties:
enable.auto.commit=false
auto.offset.reset=earliest
If I set a RetryTemplate, I get a class cast exception since it only works for non-batch consumers. Documentation states retry is not available for batch so this may be OK.
I then setup a consumer such as this one:
```java
#KafkaListener(containerFactory = "conatinerFactory",
groupId = "myGroup",
topics = "myTopic")
public void onMessage(#Payload List<Entries> batchedData,
#Header(required = false,
value = KafkaHeaders.OFFSET) List<Long> offsets,
Acknowledgment ack) {
log.info("Working on: {}" + offsets);
int x = 1;
if(x == 1) {
log.info("Failure on: {}" + offsets);
throw new RuntimeException("mock failure");
}
// do nothing else for now
// unreachable code
ack.acknowledge();
}
```
When I send a message into the system to mock the exception above then the only visible action to me is that the listener reports the exception.
When I send another (new) message into the system, the container consumes the new message. The old message is skipped since the offset is advanced to the next offset.
Since I have asked the container not to acknowledge (directly or indirectly) and since there is no other properties that I can see to notify the container not to advance, then I am confused why the container does advance.
What I noticed is that for a similar consideration, what is being recommended is to upgrade to 2.1.x and use the container stop capability that was added into the ContainerAware ErrorHandler there.
But what if you are trapped in 1.3.x for the time being, is there a way or missing property that can be used to ensure the container does not advance to the next message or batch of messages?
I can see an option to create a custom framework around the consumer in order to achieve the desired effect. But are there other options, simpler, and more spring friendly.
Thoughts?
From #garyrussell (spring-kafka github project)
The offset has not been committed but the broker won't send the data again. You have to re-seek the topics/partitions.
2.1 provides the SeekToCurrentBatchErrorHandler which will re-seek automatically for you.
2.0 Added consumer-aware listeners, giving you access to the consumer (for seeking) in the listener.
With 1.3.x you have to implement ConsumerSeekAware and perform the seeks yourself (in the listener after catching the exception). Save off the ConsumerSeekCallback in a ThreadLocal.
You will need to add the partitions to your method signature; then seek to the lowest offset in the list for each partition.
Suppose you have Spark + Standalone cluster manager. You opened spark session with some configs and want to launch SomeSparkJob 40 times in parallel with different arguments.
Questions
How to set reties amount on job failures?
How to restart jobs programmatically on failure? This could be useful if jobs failure due lack of resources. Than I can launch one by one all jobs that require extra resources.
How to restart spark application on job failure? This could be useful if job lack resources even when it's launched simultaneously. Than to change cores, CPU etc configs I need to relaunch application in Standalone cluster manager.
My workarounds
1) I pretty sure the 1st point is possible, since it's possible at spark local mode. I just don't know how to do that in standalone mode.
2-3) It's possible to hand listener on spark context like spark.sparkContext().addSparkListener(new SparkListener() {. But seems SparkListener lacks failure callbacks.
Also there is a bunch of methods with very poor documentation. I've never used them, but perhaps they could help to solve my problem.
spark.sparkContext().dagScheduler().runJob();
spark.sparkContext().runJob()
spark.sparkContext().submitJob()
spark.sparkContext().taskScheduler().submitTasks();
spark.sparkContext().dagScheduler().handleJobCancellation();
spark.sparkContext().statusTracker()
You can use SparkLauncher and control the flow.
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
See API for more details.
Since it creates process you can check the Process status and retry e.g. try following:
public boolean isAlive()
If Process is not live start again, see API for more details.
Hoping this gives high level idea of how we can achieve what you mentioned in your question. There could be more ways to do same thing but thought to share this approach.
Cheers !
check your spark.sql.broadcastTimeout and spark.broadcast.blockSize properties, try to increase them .
I'm currently developing some web services in Java (& JPA with MySQL connection) that are being triggered by an SAP System.
To simplify my problem I'm referring the two crucial entities as BlogEntry and Comment. A BlogEntry can have multiple Comments. A Comment always belongs to exactly one BlogEntry.
So I have three Services (which I can't and don't want to redefine, since they're defined by the WSDL I exported from SAP and used parallel to communicate with other Systems): CreateBlogEntry, CreateComment, CreateCommentForUpcomingBlogEntry
They are being properly triggered and there's absolutely no problem with CreateBlogEntry or CreateComment when they're called seperately.
But: The service CreateCommentForUpcomingBlogEntry sends the Comment and a "foreign key" to identify the "upcoming" BlogEntry. Internally it also calls CreateBlogEntry to create the actual BlogEntry. These two services are - due to their asynchronous nature - concurring.
So I have two options:
create a dummy BlogEntry and connect the Comment to it & update the BlogEntry, once CreateBlogEntry "arrives"
wait for CreateBlogEntry and connect the Comment afterwards to the new BlogEntry
Currently I'm trying the former but once both services are fully executed, I end up with two BlogEntries. One of them only has the ID delivered by CreateCommentForUpcomingBlogEntry but it is properly connected to the Comment (more the other way round). The other BlogEntry has all the other information (such as postDate or body), but the Comment isn't connected to it.
Here's the code snippet of the service implementation CreateCommentForUpcomingBlogEntry:
#EJB
private BlogEntryFacade blogEntryFacade;
#EJB
private CommentFacade commentFacade;
...
List<BlogEntry> blogEntries = blogEntryFacade.findById(request.getComment().getBlogEntryId().getValue());
BlogEntry persistBlogEntry;
if (blogEntries.isEmpty()) {
persistBlogEntry = new BlogEntry();
persistBlogEntry.setId(request.getComment().getBlogEntryId().getValue());
blogEntryFacade.create(persistBlogEntry);
} else {
persistBlogEntry = blogEntries.get(0);
}
Comment persistComment = new Comment();
persistComment.setId(request.getComment().getID().getValue());
persistComment.setBody(request.getComment().getBody().getValue());
/*
set other properties
*/
persistComment.setBlogEntry(persistBlogEntry);
commentFacade.create(persistComment);
...
And here's the code snippet of the implementation CreateBlogEntry:
#EJB
private BlogEntryFacade blogEntryFacade;
...
List<BlogEntry> blogEntries = blogEntryFacade.findById(request.getBlogEntry().getId().getValue());
BlogEntry persistBlogEntry;
Boolean update = false;
if (blogEntries.isEmpty()) {
persistBlogEntry = new BlogEntry();
} else {
persistBlogEntry = blogEntries.get(0);
update = true;
}
persistBlogEntry.setId(request.getBlogEntry().getId().getValue());
persistBlogEntry.setBody(request.getBlogEntry().getBody().getValue());
/*
set other properties
*/
if (update) {
blogEntryFacade.edit(persistBlogEntry);
} else {
blogEntryFacade.create(persistBlogEntry);
}
...
This is some fiddling that fails to make things happen as supposed.
Sadly I haven't found a method to synchronize these simultaneous service calls. I could let the CreateCommentForUpcomingBlogEntry sleep for a few seconds but I don't think that's the proper way to do it.
Can I force each instance of my facades and their respective EntityManagers to reload their datasets? Can I put my requests in some sort of queue that is being emptied based on certain conditions?
So: What's the best pracice to make it wait for the BlogEntry to exist?
Thanks in advance,
David
Info:
GlassFish Server 3.1.2
EclipseLink, version: Eclipse Persistence Services - 2.3.2.v20111125-r10461
If you are sure you are getting a CreateBlogEntry call, queue the CreateCommentForUpcomingBlogEntry calls and dequeue and process them once you receive the CreateBlogEntry call.
Since you are on an application server, for queues, you can probably use JMS queues that autoflush to storage or use the DB cache engine (Ehcache ?), in case you receive a lot of calls or want to provide a recovery mechanism across restarts.