Kafka - Assign messages to specific Consumer Groups

Kafka - Assign messages to specific Consumer Groups - java

I have an small question about Kafka Group IDs, I can use this Annotaiton in Java:
#KafkaListener(topics = "insert", groupId = "user")
There I can set an groupId which it wanna consume, but i does not consume just this group id maybe because of that I can't send to specific group id. How I can send just to an special groupid? For what I can use the GroupID or I need to set the Topic special for sending the Kafka Messages specific?
I tried already to find an answer online, but I find nothing, maybe I use google false haha
I hope all understand me, if not pls quest :)
Thx alot already!

Welcome to Kafka! First of all: You can't send to a consumer group, you send to a Topic.
Too much text below. Be aware of possible drowsiness while trying to read the entire answer.
If you are still reading this, I assume you truly want to know how to direct messages to specific clients, or you really need to get some sleep ASAP.
Maybe both. Do not drive afterwards.
Back to your question.
From that topic, multiple Consumer Groups can read. Every CG is independent from others, so each one will read the topic from start to end, by their own. Think of a CG as an union of endophobic consumers: they won't care about other groups, they won't ever talk to another group, they don't even know if others exist.
I can think of three different ways to achieve your goal, by using different methodologies and/or architectures. The only one using Consumer Groups is the first one, but the other two may also be helpful:
subscribe
assign
Multiple Topics
The first two ones are based on mechanisms to divide messages within a single topic. The third one should only be justified on certain cases. Let's get into these options.
1. Subscribe and Consumer Groups
You could create a new Topic, fill it with messages, and add some metadata in order to recognize who needs to process each message (to who that message is directed).
Messages stored in Kafka contain, among other fields , a KEY and a VALUE (the message itself).
So let's say you want only GROUP-A to process some specific messages. One simple solution could be including an identifier on the key, such as a suffix. One of your keys could look like: key#GA.
On the consumer side, you poll() the messages from that topic, and add a little extra conditional logic before processing it: you'll just read the key and check the suffix. If it corresponds with the specified consumer group, in this case, it contains GA, then the consumer from GROUP-A knows that it must process the message.
For example, your Topic stores messages of two different natures, and you want them to be directed to two groups: GROUP-A and GROUP-Z.
key value
- [11#GA][MESSAGE]
- [21#GZ][MESSAGE]
- [33#GZ][MESSAGE]
- [44#GA][MESSAGE]
Both consumer groups will poll those 4 messages, but only some of them will be processed by each group.
Group-A will discard the 2nd and 3rd messages. It will process the 1st and 4th.
Group-Z will discard the 1st and 4th messages. It will process the 2nd and 3rd.
This is basically what you are aiming, but using some extra logic and playing with Kafka's architecture. The messages with certain suffix will be "directed" to an specific consumer group, and ignored by the other ones.
2. Assign
The above solution is focused on consumer groups and Kafka's subscribe methodology. Another possible solution, instead of subscribing consumer groups, would be to use Kafka's assign method. No ConsumerGroup is involved here, so references to the previous groups will be quoted in order to avoid any confusion.
Assign allows you to directly specify the topic/partition from which your consumer must read.
In the producer side, you should partition your messages in order to divide them between the partitions within your topic, using your own logic. Some more deeper info about custom partitioners here (yeah the author from the link seems like a complete douche).
For example, let's say you have 5 different types of consumers. So you create a Topic with 5 partitions, one for each "group". Your producer's custom partitioner identifies the corresponding partition for each message, and the topic would present this structure after producing the messages from the previous example:
In order to direct the messages to their corresponding "groups" :
"Group-Z" is assigned the 5th partition.
"Group-A" is assigned the 1st partition.
The advantage of this solution is that less resources are wasted: each "group" just polls his own messages, and as every message is verified to be directed to the consumer which polled it, you avoid the discard/accept logic: less traffic on the wire, fewer objects in memory, fewer cpu work.
The disadvatange consists in a more complex Kafka producer mechanism, which involves a custom partitioner, that most surely should be constantly updated regarding changes on your data or topic structures. Moreover, this will also lead to update the defined assigments of your consumers as well, each time the producer side is altered.
Personal note:
Assign offers a better perfomance, but carries a high price: manual and constant control of producers, topics, partitions and consumers, hence being (possibly) more error-prone. I would call it the efficient solution.
Subscribe makes all the process much simpler, and possibly will involve fewer problems/error on the system, hence being more reliable. I would call it the effective solution.
Anyway, this is a totally subjective oppinion.
Not finished yet
3. . Multi-topic solution.
The previously proposed solutions assume that the messages share the same nature, hence will be produced in the same Topic.
In order to explain what I'm trying to say here, let's say a Topic is represented as a storage building.
<--laptops, tables, smartphones,...
The previous solutions assume that you store similar elements there, for example, electronic devices; Their end of life is similar, the storage method is similar regardless of the specific device type, the machinery you use is the same, etc. With this in mind, it's completely logical to store all those elements into the same warehouse, divided in different sections (into the same topic, divided in different partitions).
There is no real reason to build a new warehouse for each electronic-device family (one for tv, other for mobile phones,... unless you are wrapped in money). The previous solutions assume that your messages are different types of "electronic devices".
But time passes by and you are doing well, so decide to start a new business: fruits storage.
Fruits have fewer life (log.retention.ms anyone?), must be stored inside a range of temperature, and probably your device storing elements and techniques from the first warehouse will differ by a lot. Moreover, you fruit business could be closed on certain periods of the year, while electronic devices are received 365/24. Even if you open your device's warehouse daily, maybe the fruit storage is only working on mondays and tuesdays (and with luck is not temporaly closed because of the period).
As fruits and electronic devices need different types of storage management, you decide to build a new warehouse. Your new fruis topic.
<--bananas, kiwis, apples, chicozapotes,...
Creating a second topic is justified here, since each one could need different configuration values, and each one stores contents from very different natures. This leads to consumers with also very different processing logics.
So, is this a 3rd possible solution?
Well, it does make you forget about consumer groups, partitioning mechanisms, manual assignations, etc. You only have to decide which consumers subscribe to which Topic, and you're done: you effectively directed the messages to specific consumers.
But, if you build a warehouse and start storing computers, would you really build another warehouse to store the phones that just arrived? In real life, you'll have to pay for the construction of the second building, as well as pay two taxes, pay for the cleaning of two buildings, and so on.
laptops here-> <-tablets here
In kafka's world, this would be represented as extra work for the kafka cluster(twice replication petitions,zookeeper has a newborn with new ACLs and controllers, ...), extra time for the human assigned to this job, since now is responsible of the management of two topics: A worker spending time on something that could be avoided is synonym of €€€ lost by the company. Also, I am not aware if they already do this or ever will, but cloud providers are somehow fond to insert small taxes for certain operations, such as creatiing a topic (but this is just a possibility, and I may be wrong here).
To resume, this is not necessarilly a bad idea: it just needs a justified context. Use it if you are working with Bananas and Qualcomm chips.
If you are working with Laptops and Tablets, go for the consumer group and partition solutions previously shown.

Related

Kafka Streams: Should we advance stream time per key to test Windowed suppression?

I learnt from This blog and this tutorial that in order to test suppression with event time semantics, one should send dummy records to advance stream time.
I've tried to advance time by doing just that. But this does not seem to work unless time is advanced for a particular key.
I have a custom TimestampExtractor which associates my preferred "stream-time" with the records.
My stream topology pseudocode is as follows (I use the Kafka Streams DSL API):
source.mapValues(someProcessingLambda)
.flatMap(flattenRecordsLambda)
.groupByKey(Grouped.with(Serdes.ByteArray(), Serdes.ByteArray()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ZERO))
.aggregate(()->null, aggregationLambda)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
My input is of the following format:
1 - {"stream_time":"2019-04-09T11:08:36.000-04:00", id:"1", data:"..."}
2 - {"stream_time":"2019-04-09T11:09:36.000-04:00", id:"1", data:"..."}
3 - {"stream_time":"2019-04-09T11:18:36.000-04:00", id:"2", data:"..."}
4 - {"stream_time":"2019-04-09T11:19:36.000-04:00", id:"2", data:"..."}
.
.
Now records 1 and 2 belong to a 10 minute window according to stream_time and 3 and 4 belong to another.
Within that window, records are aggregated as per id.
I expected that record 3 would signal that the stream has advanced and cause suppress to emit the data corresponding to 1st window.
However, the data is not emitted until I send a dummy record with id:1 to advance the stream time for that key.
Have I understood the testing instruction incorrectly? Is this expected behavior? Does the key of the dummy record matter?

I’m sorry for the trouble. This is indeed a tricky problem. I have some ideas for adding some operations to support this kind of integration testing, but it’s hard to do without breaking basic stream processing time semantics.
It sounds like you’re testing a “real” KafkaStreams application, as opposed to testing with TopologyTestDriver. My first suggestion is that you’ll have a much better time validating your application semantics with TopologyTestDriver, if it meets your needs.
It sounds to me like you might have more than one partition in your input topic (and therefore your application). In the event that key 1 goes to one partition, and key 3 goes to another, you would see what you’ve observed. Each partition of your application tracks stream time independently.
TopologyTestDriver works nicely because it only uses one partition, and also because it processes data synchronously. Otherwise, you’ll have to craft your “dummy” time advancement messages to go to the same partition as the key you’re trying to flush out.
This is going to be especially tricky because your “flatMap().groupByKey()” is going to repartition the data. You’ll have to craft the dummy message so that it goes into the right partition after the repartition. Or you could experiment with writing your dummy messages directly into the repartition topic.
If you do need to test with KafkaStreams instead of TopologyTestDriver, I guess the easiest thing is just to write a “time advancement” message per key, as you were suggesting in your question. Not because it’s strictly necessary, but because it’s the easiest way to meet all these caveats.
I’ll also mention that we are working on some general improvements to stream time handling in Kafka Streams that should simplify the situation significantly, but that doesn’t help you right now, of course.

program that simulates queuing and service by requests at a fast food restaurant

I was asked to code a program without using any data structure libraries.
Inputs are:
The number of primary servers in the system.
The number of secondary servers in the system.
A set of service requests each consisting of an arrival time and two service times.
This set is terminated by a dummy record with arrival time and service times all equal to 0. (Note: the arrival times are sorted in ascending order).
I'm quite new to java so I would like to get advice what's the best way of doing this or resources which would help me understand the concept better.
I know we do need to create 2 Queues, one for the primary and secondary server to store the data while they are waiting to be served.
I probably have to create counters to increment and decrement for the time. Hopefully my thought process is right.
But I'm unsure how do we go about creating multiple Queues and what data structure i would use for the servers.

The way to approach this problem is to draw a diagram of the real-world situation. Customers come in and line up. There are X servers, each of which can handle one customer at a time. Also model the role of the secondary servers, whatever it is.
Then start handling transactions: server engages customer, passes order to secondary server queue and waits. Gets response from secondary server, finishes with customer, etc. Describe every place where information is exchanged between customers, servers, and secondary servers.
If you do that, you have a very good understanding of the problem you're trying to solve, and a real-world solution. Then you just have to model that in code. Your best bet is to first write a basic outline in pseudocode that describes the data structures and algorithms you're going to use. Once you have that, you can simulate its operation by (again) resorting to pencil and paper.
When you're convinced that you have the algorithm right, then you sit down to write code. And writing the code is pretty much a straightforward translation of your pseudocode.

Choosing databasetype for a decentralized calendar project

I am developing a calendar system which is decentralised. It should save the data on each device and synchronise if they have both internet connection. My first idea was, just using a relational database and try to synchronise data after connection. But the theory says something else. The Brewers CAP-Theorem describes the theory behind it, but i am not sure if this theorem maybe is outdated. If i use this theorem i have "AP [Availability/Partition Tolerance] Systems". "A" because i need at any given time the data for my calendar and "P" because it can happen, that there is no connection between the devices and the data can't be synchronised. The example databases are CouchDB, RIAK or Cassandra. I have worked only with relational databases and doesn't know how to go on now. Is it that bad to use a relational Database for my project?
This is for my bachelor thesis. I just wanted to start using Postgres but then i found this theorem...
The whole project is based on Java.

I think the CAP theorem isn't really helpful to your scenario. Distributed systems that deal with partitions need to decide what to when one part wants to make a modification to the data, but can't reach the other part. One solution is to make the write wait - and this is giving up the "availability" because of the "partition", one of the options presented by the CAP theorem. But there are more useful options. The most useful (highly-available) option is to allow both parts to be written independently, and reconcile the conflicts when they can connect again. The question is how to do that, and different distributed systems choose different approaches.
Some systems, like Cassandra or Amazon's DynamoDB, use "last writer wins" - when we see a conflict between two conflicting writes, the last one (according some synchronized clock) wins. For this approach to make sense you need to be very careful about how you model your data (e.g., watch out for cases where the conflict resolution results in an invalid mixture of two states).
In other systems (and also in Cassandra and DynamoDB - in their "collection" types) writes can still happen independently on different nodes, but there is more sophisticat conflict resolution. A good example is Cassandra's "list": One can send an update saying "add item X to the list", and another update saying "add item Y to the list". If these updates happen on different partitions, the conflict is later resolved by adding both X and Y to the list. The data structures such as this list - which allows the content to be modified independently in certain ways on two nodes and then automatically reconciled in a sensible way, is known as a Conflict-free Replicated Data Type (CRDT).
Finally, another approach was used in Amazon's Dynamo paper (not to be confused by their current DynamoDB service!), known as "vector clocks": When you want to write to an object - e.g., a shopping cart - you first read the current state of the object and get with it a "vector clock", which you can think of as the "version" of the data you got. You then make the modification (e.g., add an item to the shopping cart), and write back the new version while saying what was the old version you started with. If two of these modifications happen on parallel on different partitions, we later need to reconcile the two updates. The vector clocks allow the system to determine if one modification is "newer" than the other (in which case there is no conflict), or they really do conflict. And when they do, application-specific logic is used to reconcile the conflict. In the shopping cart example, if we see the conflict is that in one partition item A was added to the shopping cart and in the other partition, item B was added to the shopping cart, the straightforward resolution is to just add both times A and B to the shopping cart.
You should probably pick one of these approaches. Just saying "the CAP theorem doesn't let me do this" is usually not an option ;-) In fact, in some ways, the problem you're facing is different than some of the systems I mentioned. In those systems, the common case is every node is always connected (no partition), with very low latency, and they want this common case to be fast. In your case, you can probably assume the opposite: the two parts are usually not connected, or if they are connected there is high latency, so conflict resolution because the norm, rather than the exception. So you need to decide how to do this conflict resolution - what happens if one adds a meeting on one device and a different meeting on the other device (most likely, just keep both as two meetings...), how do you know that one device modified a pre-existing meeting and didn't add a second meeting (vector clocks? unique meeting ids? etc.) so the conflict resolution ends up fixing the existing meeting instead of adding a second one? And so on. Once you do that, where you store the data on both partitions (probably completely different database implementations in the client and server) and which protocol you send the updates on become implementation details.
There's another issue you'll need to consider. When do we do these reconciliations? In many systems like I listed above, the reconciliation happens on read: If the client wants to read data and we suddenly see two conflicting versions on two reachable nodes, we reconcile. In your calendar application, you need a slightly different approach: It is possible that the client will only ever try to read (use) the calendar when not connected. You need to use the rare opportunities when he is connected to reconcile all the differences. Moreover, you may need to "push" changes - e.g., if the data on the server changed, the client may need to be told, "hey, I have some changed data, come and reconcile", so the end-user will immediately see an announcement on a new meeting, for example, that was added remotely (e.g., perhaps by a different user sharing the same calendar). You'll need to figure out how you want to do this. Again, there is no magic solution like "use Cassandra".

Some questions regarding architecture/design to this usecase?

My application needs to work as middleware where it has got orders(in form of xml) from various
customers which contains the supplier id. Once it get the order, it needs to send order request
to different suppliers in the form of xml.i am double minded about three aspects of it. Here they are:-
Questions:
What i am planning at high level is as soon as request come, put it on jms queue.(Now i am not sure
should i create queue for each supplier or one queue should be sufficient. I think one queue will be sufficient.
as maintaining large number of queues will be overhead.). Advantage of maintaining separate queue per supplier is message can be processed faster as there will be separate producer on each queue.
Before putting the object on queue
i need to do some business validations. Also the structure of input xml i am receiving and output xml i need to send to supplier is different. For this i am planning to convert the input xml to java object then put on queue
so that validation can be done with ease at consumer side. Another thought is dont convert the xml into java object, just get all elements
value thru xpath/xstream api and validate them and put xml string as it is on queue because. Then at consumer side convert xml to java object then to different xml format. Is there a way of doing it?
Now my requirement is consumer on queue process the messages on queue after every 5 hours and send the xml request
to suppliers. I am planning to use quartz scheduler here which will pick the job one by one and send to corresponding
supplier based on supplierId. Here is my question is if my job pick the message one by one and then send it to supplier.
it will be too slow . I am planning to handle it where quartz job create ThreadPool with size of say ten threads at time
which concurrently process the messages from queue(So here will be multiple consumers on queue. I think thats not valid for queue. Do i need topic here instead of queue?). Is second approach is better or there is some better than this?
i am expecting a load of 50k request per hour which mean around 15 request per second

Your basic requirement is ,
Get order from customer as XML ( you have not told how you are receiving)
Do basic Business validation .
Send the Orders to Suppliers
And you would be excepting 50k Request ( You haven't provided the approximate an Order size).
Assuming your Average order size 10K, it would be around 500 MB required just to hold it in Queue ( irrespective of number of queues) . i am not sure which environment you are running.
For Point #1
I would choose single Queue instead of multiple Queue
- Choose the appropriate persistent store.
I am assuming you would be using Distributed Queue , so that it can be easily scale while adding clusters.
For Point #2
I would be converting in POJO (Your own format ) and perform business validation. So that later if you want to extend the business validation to ruler or any other conversion it would be easy to maintain.
- basically get the input in any form ( XML / POJO / JSON ...) and convert into Middle format ( you can write custome validator / conversion utility on top of Middle fomart) . And have Keep Mappings between the Common format to input as well output. So that you can write formatters and use them. which will not impact in future while changing format for any specific supplier. Try to externalize the format mapping.
For Point # 3
in your case, A Order needs to be processed by only once. So i would go with Queue. and you can have multiple Message Listeners . Message listeners deliver order in asynchronous. So you can have multiple Listeners for an Queue. And each listeners would run separate thread.
Is there a problem to send the orders as soon as it received ? It would be good for you as well as the supplier to avoid heavy load at particular time.

Since you are the middleware, you should handle data quick at the point of contact, to get your hands free for more incoming requests. Therefore you must find a way to distinquish the incoming data as quick and memory low as possible. Leave the processing of the data to modules more specific to the problem. A receptionist just directs the guests in the right spot.
If you really have to read and understand the received data in your specialized worker later on, use a threadpool. This way you can process the data parallelly without too much worry about outofmem. Just choose your pool size smartly and use only one. You could use a listener pattern to signal new incoming data to the worker multiton. You should avoid jaxb or better the complete deserialization of the data if possible. It eats up memory like hell.
I would not use jmx because you "messages" are relevant for only one listener.
If it is possible send the mail as soon as the worker is done with its work. If not, use a storage. This way you can later proove you processed the data and if something went wrong or you have to update your software, you do not have to worry about volatile data.

JMS Topic message size

Our application uses a topic to push message to a small set of subscribers. what sort of things should i look for when modeling a jms message with respect to the size of the actual message to be pushed. Are there any known limits or is application server specific? Any best practices or suggestions on this topic (pun unintended)?

You are likely to hit practical limits before you hit technical ones. That is, message lengths may have technical limits in the lengths that can be expressed in an int or long, but that's unlikely to be the first constraint you hit.
Message lengths up in the Megabytes tend to be heavyweight. Think in terms of a few K as the sort of ballpark you want to be in.
A technique used sometimes is to send small messages saying "Item 123435 has been updated", consumers then go retrieve data associated with Item 12345 from a database or other storage mechanism. Hence each client can get only the data they need, we don't spray large chunks of data around when subscribers may not need it all.

I suggest you to check the book Enterprise Integration Patterns, where a lot of patterns dealing with issues like the one you are asking are exhaustively analyzed. In particular, if the size of your message is large, you can use Message Sequence to solve the problem:
Huge amounts of data: Sometimes
applications want to transfer a really
large data structure, one that may not
fit comfortably in a single message.
In this case, break the data into more
managable chunks and send them as a
Message Sequence. The chunks
have to be sent as a sequence, and not
just a bunch of messages, so that the
receiver can reconstruct the original
data structure.
Quoted from http://www.eaipatterns.com/MessageConstructionIntro.html
The home page for a brief description of each pattern of the book is available at http://www.eaipatterns.com/index.html

It is implementation specific. As you might expect, smaller is better. Tibco, for instance, recommends to keep message sizes under 100 KB.

Small messages are faster, obviously. Having said that, the underlying JMS server implementation may improve performance using, for instance, message compression, such as Weblogic 9. (http://download.oracle.com/docs/cd/E13222_01/wls/docs92/perform/jmstuning.html#wp1150012)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.