How to write Kafka consumers - single threaded vs multi threaded

How to write Kafka consumers - single threaded vs multi threaded - java

I have written a single Kafka consumer (using Spring Kafka), that reads from a single topic and is a part of a consumer group. Once a message is consumed, it will perform all downstream operations and move on to the next message offset. I have packaged this as a WAR file and my deployment pipeline pushes this out to a single instance. Using my deployment pipeline, I could potentially deploy this artifact to multiple instances in my deployment pool.
However, I am not able to understand the following, when I want multiple consumers as part of my infrastructure -
I can actually define multiple instances in my deployment pool and
have this WAR running on all those instances. This would mean, all of
them are listening to the same topic, are a part of the same consumer
group and will actually divide the partitions among themselves. The
downstream logic will work as is. This works perfectly fine for my
use case, however, I am not sure, if this is the optimal approach to
follow ?
Reading online, I came across resources here and here,
where people are defining a single consumer thread, but internally,
creating multiple worker threads. There are also examples where we
could define multiple consumer threads that do the downstream logic.
Thinking about these approaches and mapping them to deployment
environments, we could achieve the same result (as my theoretical
solution above could), but with less number of machines.
Personally, I think my solution is simple, scalable but might not be optimal, while the second approach might be optimal, but wanted to know your experiences, suggestions or any other metrics / constraints I should consider ? Also, I am thinking with my theoretical solution, I could actually employ bare bones simple machines as Kafka consumers.
While I know, I haven’t posted any code, please let me know if I need to move this question to another forum. If you need specific code examples, I can provide them too, but I didn’t think they are important, in the context of my question.

Your existing solution is best. Handing off to another thread will cause problems with offset management. Spring kafka allows you to run multiple threads in each instance, as long as you have enough partitions.

If your current approach works, just stick to it. It's the simple and elegant way to go.
You would only go to approach 2 in case you cannot for some reason increase the number of partitions but need higher level of parallelism. But then you have ordering and race conditions to worry about. If you ever need to go that route, I'd recommend the akka-stream-kafka library, which provides facilities to handle offset commits correctly and to do what you need in parallel and then merge back into a single stream preserving the original ordering, etc. Otherwise, these things are error-prone to do yourself.

Related

Axon Framework - is it possible to have a single tracking event processor for multiple sagas?

Let's start of by saying that I'm using version 3.1.2 of Axon Framework with tracking event processors enabled for both #EventHandlers and Sagas.
The current default behaviour for creating event processors for Sagas, as I'm seeing it, is to create a single tracking event processor for a single Saga. This works quite well on a microservice scale, but might turn out to be a problem in big monolithic applications which may implement a lot of Sagas. Since I'm writing such application, I want to have a better control over number of running threads, which in turn will give me better control over usage of database connection pool, context switching and memory usage. Ideally, I would like to have as many tracking event processors as CPU cores where each event processor executes multiple sagas and/or #EventHandlers.
I have already figured that I'm able to do that for #EventHandlers via either #ProcessingGroup annotation or EventHandlingConfiguration::assignHandlersMatching method, but SagaConfiguration does not seem to expose similar API. In fact, the most specific SagaConfiguration::trackingSagaManager method is hardcoded to create a new TrackingEventProcessor object, which makes me think what I'm trying to achieve is currently impossible. So here's my question: Is there some non-straightforward way that I'm missing which will let me execute multiple Sagas in the context of a single event processor?

I can confirm with you that it is (currently) not possible to have multiple Sagas be managed by a singleEventProcessor. Added to that, I'm doubting about the pro's and con's to doing so, as your scenario doesn't sound to weird at first glance.
I recommend to drop a feature request on the AxonFramework GitHub page. That way we (1) document this idea/desire and (2) have a good place to discuss whether to implement this or not.

Why should we have 1 Pact file for EVERY consumer if all consumers use same API in same way?

I am trying to introduce Pact framework in our company, and one of the concerns raised was below:
Scenario : This xyz API is called by 40 consumers, every consumer needs the same functionalities currently. So why should we maintain 40 Pact Files as opposed to just maintaining a single file?
Is there any better approach than to have ONE pact file for EACH consumer, considering the pact file maintainence?

If you have 40 consumers using exactly the same functionalities, there wouldn't be an issue with using a single pact file for all those interactions.
However, I find this extremely hard to believe that this is your case and as far as I've seen it in reality, never actually happens. Your provider might have all this functionality, but each consumer doesn't have to test for that functionality unless your consumer actually uses it. Furthermore, each consumer might have a different way to call/access this functionality with different type of data, headers or URL, which makes it unique and needed to test the provider fully for all potential edge cases.
Also, unless the consumer updates to use this new functionality, there's no reason to update the pact file. The point is to try to create independent files for each consumer-provider interaction; this does however increase maintenance when you have multiple consumers to a provider.
We will take your feedback and see what we can do for the future of the product to make this maintenance easier to do or minimize the impact.

Distributed Metrics

I have been working on a single box application which uses codehale metrics heavily for instrumentation. Right now we are moving to cloud and I have below questions on how I can monitor metrics when the application is distributed.
Is there a metrics reporter that can write metrics data to Cassandra?
When and how does the aggregation happen if there are records per server in the database?
Can I define the time interval at which the metrics data gets saved into the database?
Are there any inbuilt frameworks that are available to achieve this?
Thanks a bunch and appreciate all your help.

I am answering your questions first, but I think you are misunderstanding how to use Metrics.
You can google this fairly easily. I don't know of any (I also don't understand what you'll do with it in cassandra?). You would normally use something like graphite for that. In any case, a reporter implementation is very straight forward and easy.
That question does not make too much sense. Why would you aggregate over 2 different servers - they are independent. Each of your monitored instances should be standalone. Aggregation happens on the receiving side (e.g. graphite)
You can - see 1. Write a reporter, and configure it accordingly.
Not that i know of.
Now to metrics in general:
I think you are having the wrong idea. You can monitor X servers, that is not a problem at all, but you should not aggregate that on the client side (or database side). how would that even work? Restarts zero the clients, and essentially that means you need to track the state of each of your servers so that your aggregation does work. How do you manage outages?
The way you should monitor your servers with metrics:
create a namespace
io.my.server.{hostname}.my.metric
now you have X different namespaces, but they all have a common prefix. That means, you have grouped them.
Send them to your prefered monitoring solution.
There are heaps out there. I do not understand why you want this to be cassandra - what kind of advantage do you gain from that? http://graphite.wikidot.com/ for example is a graphng solution. Your applications can automatically submit data there (graphite comes with a reporter in java that you can use). See http://graphite.wikidot.com/screen-shots on how it looks like.
The main point is that graphite (and all or most providers) know how to handle your namespaces. E.g. also look at Zabix, which can do the same thing.
Aggregations
Now the aggregation happens on the receiving side. Your provider knows how to do that, and you can define rules.
For example, you could wildcard alerts like:
io.my.server.{hostname}.my.metric.count > X
Graphite (I believe) even supports operations, e.g:
sum(io.my.server.{hostname}.my.metric.request) - which would sum up ALL your hosts's requests
That is where the aggregation happens. At that point, your servers are again standalone (as they should), and have no dependency on each other or any monitoring database etc. They simply report on their own metrics (which is what they should do) and you - as the consumer of those metrics - are responsible to make the right alerts/aggregations/formulars on the receiving end.
Aggregating this on server side would involve:
Discover all other servers
Monitor their state
Receive/send metrics back and forth
Synchronise what they report etc
That just sounds like a nightmare for maintenance :) I hope that gives you some inside/ideas.
(Disclaimer: Neither a metrics dev nur a graphite dev - this is just how I did this in the past/ and the approach I still use)
Edit:
With your comment in mind, here are my two fave solutions on what you want to achieve:
DB
you can use the DB and store dates e.g. for start message and end message.
This is not really a metric thing so maybe not preferred. As per your question you could write your own reporter on that, but it would get complicated with regards to upserts/updates etc. I think option 2 is easier and has more potential.
Logs
This is I think what you need. Your servers independently log on Start/Stop/Pause etc - whatever it is you want to report on. You then set up logstash and collect those logs.
Logstash allows you to track these events over time and create metrics on it, see:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html
Or:
https://github.com/logstash-plugins/logstash-filter-elapsed
The first one uses actual metrics. The second one is a different plugin that just measures times between start/stop events.
This is the option with the most potential because it does not rely on any format/ any data store or anything other. You even get Kibana for plotting out of the box if you use the entire ELK stack.
Say you wanted to measure your messages. You can just look for the logs, there are no application changes involved. The solution does not even touch on your application (e.g. storing your reporting data manually does take up threads and processing in your applications, so if you need to be real-time compatible this will put your overall performance down), it is a complete standalone solution. Later on, when wanting to measure other metrics, you can easily add to your logstash configuration and start doing other metrics.
I hope this helps

Parallel and Transactional Processing in Java (Java EE)

I have an architectural question about how to handle big tasks both transactional and scalable in Java/Java EE.
The general challenge
I have a web application (Tomcat right now, but that should not limit the solution space, so just take this to illustrate what I'd like to achieve). This web application is distributed over several (virtual and physical) nodes, connected to a central DBMS (MySQL in this case, but again, this should not limit the solution...) and able to handle some 1000s of users, serving pages, doing stuff, just as you'd expect from your average web-based information system.
Now, there are some tasks which affect a larger portion of data and the system should be optimized to carry out these tasks reasonably fast. (Faster than processing everything sequentially, that is). So I'd make the task parallel and distribute it over several (or all) nodes:
(Note: the data portions which are processed are independent, so there are no database or locking conflicts here).
The problem is, I'd like the (whole) task to be transactional. So if one of the parallel subtasks fails, I'd like to have all other tasks rolled back as a result. Otherwise the system would be in a potentially inconsistent state from a domain perspective.
Current implementation
As I said, the current implementation uses Tomcat and MySQL. The nodes use JMS to communicate (so there is a JMS server to which a dispatcher sends a message for each subtask; and executors take tasks from the message queue, execute them, and post the results to a result queue from which the dispatcher collects the results. The dispatcher blocks and waits for all results to come in and if anything is fine, it terminates with an OK status.
The problem here is that all the executors have their own local transaction context, so the picture would look like this:
If for some reason one of the subtasks fails, the local transaction is rolled back and the dispatcher gets an error result. (There is some failsafe mechanism here, which tries to repeat the failed transaction, but let's assume for some reason, the one task cannot be completed).
The problem is that the system now is in a state where all transactions but one is already committed and completed. And because I cannot get the one final transaction to finish successfully, I cannot get out of this state.
Possible solutions
These are the thoughts which I have followed so far:
I could somehow implement a domain-specific rollback mechanism myself. Because the distributor knows which tasks have been carried out, it could revert the effects explicitly (e.g. storing old values somewhere and revert already committed values back to the previous values). Of course, in this case, I must guarantee that no other process changes something in between, so I'd also have to set the system to a read-only state, as long as the big operation is running.
More or less, I'd need to simulate a transaction in business logic ...
I could choose not to parallelize and do everything on a single node in one big transaction (but as stated at the beginning, I need to speed up processing, so this is not an option...)
I have tried to find out about XATransactions or distributed transactions in general, but this seems to be an advanced Java EE feature, which is not implemented in all Java EE servers, and which would not really solve that basic problem, because there does not seem to be a way to transfer a transaction context over to a remote node in an asynchronous call. (e.g. section 4.5.3 of EJB Specification 3.1: "Client transaction context does not propagate with an asynchronous method invocation. From the Bean Developer’s view, there is never a transaction context flowing in from the client.")
The Question
Am I overlooking something? Is it not possible to distribute a task asynchronously over several nodes and at the same time have a (shared) transactional state which can be rolled back as a whole?
Thanks for any pointers, hints, propositions ...

If you want to distribute your application as described, JTA is your friend in Java EE context. Since it's part of the Java EE spec, you should be able to use it in any compliant container. As with all implementations of the spec, there are differences in the details or configuration, as for example with JPA, but in real life it's very uncommon to change application servers very often.
But without knowing the details and complexity of your problem, my advice is to rethink if you really need to share the task execution for one use case, or if it's not possible and better to have at least everything belonging to that one use case within one node, even though you might need several nodes for the overall application. In case you really have to use several nodes to fulfill your requirements, then I'd go for distributed tasks which do not write directly to the database, but give back results and then commit/rollback them in the one component which initiated the tasks.
And don't forget to measure first, before over-engeneering the architecture. Try to keep it simple at first, assuming that one node could handle it and then write a stress test which tries to break your system, to learn about the maximum possible load it can handle with the given architecture.

Monitoring Changes of a Bean to build deltas?

I have several Beans in my Application which getting updated regularly by the usual setter methods. I want to synchronize these beans with a remote application which has the same bean classes. In my case, bandwidth matters, so i have to keep the amount of transferred bytes as low as possible. My idea was to create deltas of the state changes and transfer them instead of the whole Objects. Currently, I want to write the protocol to transfer those changes by myself but I'm not bound to it and would prefer an existing solution.
Is there already a solution for this Problem out there? And if not, how could I easily monitor those state changes in an generalized way? AOP?
Edit: This problem is not caching related even it may first seem so. The data must be replicated from a central server to several clients (about 4 to 10) over the internet. The client is a standalone desktop application.

This sounds remarkably similar to JBossCache running in POJO mode.
This is a distributed, delta-based cache that breaks down java objects into a tree structure, and only transmits changes to the bits of the tree that changes.
Should be a perfect fit for you.

I like your idea of creating deltas and sending them.
A simple Map could handle the delta for one object. Serialization could simply get you the effective message send.
To reduce the number of messages that would kill your performance, you should group your deltas for all objects and send them as a whole. So you could have others collections or maps to contain this.
To monitor all changes to many beans, AOP seem like a good solution.
EDIT : see Skaffmann's answer.
Using an existing cache technology could be better.
Many problems could already have solutions implemented...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.