I am considering using Apache Kafka as a distributed message publisher to many subscribers. It is the perfect fit for me, since the solution has to scale easily.
The Kafka's documentation states that the message may be acknowledged thus ensuring the message delivery. However, today I came across this article which states that there are scenarios in which the messages may be lost. Then again, the article is only available in Google cache, so I do not know whether it is trustworthy...
So I have one doubt - is there any moment, any scenario, in which the message will be lost? In another words - my main requirement is that each message must reach its destination. Can it be met by using the Apache Kafka? Is it the right tool for this job?
The original of the article you are looking for is here: http://engineering.onlive.com/2013/12/12/didnt-use-kafka/
If you read the full article and the comments you'll see much of the concern is not about the guarantee of at least once delivery, but that it was delivered AND successfully processed by the client. The last couple of comments on the article, including by the original author, seem to indicate he's satisfied with the approach.
You might also find this article of interest - similar concerns:
https://www.mail-archive.com/users%40kafka.apache.org/msg04492.html
And from some of the documentation:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
Most of the conversations I've seen are not about the guarantee of at least once, but how to go from there to at most once or to exactly once.
Kafka does claim that
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.
It might worth reading the Message Delivery Semantics written in their doc page for a better understanding
Related
Does ActiveMQ support Idempotent producer? I know Camel has an idempotent consumer pattern to detect and handle duplicate messages, but I'm wondering if this can be prevented at the source (producer).
Here is a little back ground. I have applications that are horizontally scaled accessing same database. There is one particular table that maintains status of a particular process. These horizontal applications should be able to read the status and invoke another process, however only one of them should be able to invoke it. This application periodically polls the data base and posts a message to a messaging broker, once the required condition is met. But I want one of the load balancing application should be able to post the message.
One crude approach I'm thinking is...
On Machine 1:
Read the database for checking if the necessary condition is met.
Before posting message to the broker, write a record to another status table with a unique key that identifies the process and commits. If this operation fails due to unique key constraint violation, it means process on another machine succeeded in posting the message.
Post the message to the broker
If the message posting is failed, for some reason, perform delete operation on the status table based on the unique key/ primary key.
The same operation can be performed by same application running on machine 2 , 3, 4 etc.
Below is one pitfall I quickly notice with this approach.
Assuming that Machine 1 is able to complete step 2 but failed performing step 3 and continues with step 4. Meanwhile Machine 2, when it failed at step 2, will move on with out attempting to read the status again and post the message.
To address this, I need to put retry on step 3, until the message is successfully posted to broker.
Another option is to use https://camel.apache.org/components/latest/eips/idempotentConsumer-eip.html pattern. But this is essentially a filter at consumer side. Though this will serve my purpose, is there a similar approach out of box available on message publishing side.
I wonder, if this approach is even correct or any better alternative approach, or any existing libraries that can be used to perform locking kind of mechanism across JVM either local or remote.
It's not clear what version of ActiveMQ you're using (i.e. ActiveMQ 5.x or ActiveMQ Artemis) so I'll try to address this issue for both.
ActiveMQ 5.x doesn't have any built-in support for detecting duplicates sent from clients. However, you could potentially implement this feature using a broker plugin. The only challenge I see here is configuring, managing, and monitoring the cache of duplicate IDs.
ActiveMQ Artemis does have built in support for detecting duplicates sent from clients. You can read more about duplicate detection in the documentation. Since the broker supports this behavior natively it provides clean configuration, management, and monitoring.
In either case you'll need to set a special header on each message with "a unique key that identifies the process" just like you would for your potential database solution. Furthermore, using the broker as the duplicate detector is much simpler overall.
If you're currently using ActiveMQ 5.x but want to move to ActiveMQ Artemis in order to use the duplicate detection feature you don't necessarily need to update your clients as ActiveMQ Artemis fully supports the OpenWire protocol used by 5.x clients. You should just be able to point them to the new instance of ActiveMQ Artemis and have everything work.
We are developing an application in JAVA. We will use JMS to listen to messages coming on to MQ. We are expecting around 100K message from approx 100 users (each message approx. 1400 charachters long) per day. How many listeners is good to have for this scenario. What I am trying to know is, how many messages a JMS listeners can process per unit. Approximate number is enough for now. Is there a documentation where we can find out this information?
You have to look at two things here: server performance and client performance.
Major JMS providers (HornetQ, ActiveMQ, etc.) can easily handle 5000+ messages per second, so you are covered on that side (if you want more information have a look at the SPECjms2007 results).
Client performance depends on the computing power of your clients (obviously) and what you want to do with those messages. Technically, there isn't a limit in how many messages a client can process. My experience is that message marshalling/unmarshalling is a huge factor, so as a rough estimate you can assume that your client can handle about the same message load as your server, assuming equally powerful machines and light processing of message content.
In the end you will want to do some load testing.
We have RV messaging systems publishing and receiving messages.Recently some underlying jars were upgraded - these are serialization jars used by all publishers and subscribers. However , it seems that some of the publishers are still referencing old versions of the serialization jars and therefore the receivers fail when trying to deserialize received messages.
Obviously restarting these publisher services should fix the problem. However , how do I identify all publishers using a particular topic to send messages to ? There must be some RV admin way of listing all the processes that are publishing to a given topic ?
I just gave a similar answer on another question:
There is a really great tool for this called Rai Insight
Basically what it can do is to sit on a box and silently listen all the multicast data and represent statistics even in real time. We used it to monitor traffic flow spikes with just few seconds delay.
It can give you traffic statistics braked down by multicast group, service number or even sending machine. Traffic flow peak/average, retransmission rate peak/average. All you can think of.
It will also give you per-service per-topic information.
We are running a high throughput system that utilizes tibco-ems JMS to pass large numbers of messages to and from our main server to our client connections. We've done some statistics and have determined that JMS is the causing a lot of latency. How can we make tibco JMS more performant? Are there any resources that give a good discussion on this topic.
Using non-persistent messages is one option if you don't need persistence.
Note that even if you do need persistence, sometimes it's better to use non persistent messages, and in case of a crash perform a different recovery action (like resending all messages)
This is relevant if:
crashes are rare (as the recovery takes time)
you can easily detect a crash
you can handle duplicate messages (you may not know exactly which messages were delivered before the crash
EMS also provides some mechanisms that are persistent, but less bullet proof then classic guaranteed delivery
these include:
instead of "exactly once" message delivery you can use "at least once" or "up to once" delivery.
you may use the pre-fetch mechanism which causes the client to fetch messages to memory before your application request them.
EMS should not be the bottle neck. I've done testing and we have gotten a shitload of throughput on our server.
You need to try to determine where the bottle neck is. Is the problem in the producer of the message or the consumer. Are messages piling up on the queue.
What type of scenario are you doing.
Pub/sup or request reply?
are you having temporary queue pile up. Too many temporary queues can cause performance issues. (Mostly when they linger because you didn't close something properly)
Are you publishing to a topic with durable subscribers if so. Try bridging the topic to queue and reading from those. Durable subscribers can cause a little hiccup in performance too since it needs to track who has copies of all messages.
Ensure that your sending process has one session and multiple calls through that session. Don't open a complete session for each operation. Re-use where possible. Do the same for the consumer.
make sure you CLOSE when you are done. EMS doesn't clear things up. So if you make a connection and just close your app the connection still is there and sucking up resources.
review your tolerance for lost messages in the even of a crash. If you are doing Client ack and it doesn't matter if you crash processing the message then switch to auto. Also I believe if you are using (TEMS - Tibco EMS for WCF) there's a problem with the session acknowledge. So a message is only when its processed on the whole message, we switched from Client ACK to the one that had Dups ok and it worked better)
Using HornetQ (In JBoss AS 6.0) I would like to setup a JMS topic to which multiple clients can subscribe.
A producer periodically sends a message to this topic with a reply-to destination, to which all subscribers should reply.
The problem I'm having is that I'm not entirely sure how to check that all subscribers have indeed replied.
One solution could be that each subscriber first sends a message to the topic after subscription with its details (perhaps some GUID). The producer remembers these details and uses it to check later whether all subscribed clients have replied.
However, rather than inventing the wheel myself I would like to use something that already exists. This seems like a standard problem, but I could not find any existing solution.
You could use durable subscriptions, and then query the subscriptions and messages.
See http://hornetq.sourceforge.net/docs/hornetq-2.0.0.BETA5/user-manual/en/html/management.html#d0e5742
Note that usage of durable subscriptions and persistent messages will incur a performance penalty. You'll have to gauge the severity of the performance impact according to your specific needs.
JMS itself doesn't support this, it's too simple. If you didn't mind coupling your code to HornetQ, then you could use its native API to find out this stuff. Not ideal, but it's well written and has readable source code, so it wouldn't be too hard.