Deleting Kafka messages after consumers' receipt

Deleting Kafka messages after consumers' receipt - java

In my kafka java-project I want to delete the messages as soon as all interested consumers have received the new message. After some research I have found some old stackoverflow questions: here, one more and here. After reading all these, I've got some questions.
As far as i could understand, I really should rely on retention either by time or by space. However, the answers are old so maybe something changed? Is there any other way to really ensure that messages are deleted right after all the currently connected consumers have read the message? In this case I would need to check whether or not all consumers have read the message. Would I need a consumer-group for that?
Thank you in advance.

You don't need worry about when the messages will be deleted or compacted to achieve your goal
As far as your consumer is concerned, if the consumers commit the offset of every message they consumed, those processed messages, regardless of how long they will stay in the topic, are dead in the view of your consumers.
Note:
An administrative user can go and reset your consumer to read gain from the beginning including those messages your consumers moved on from. But that is a manual admin operation.

Related

Broker disk usage after topic deletion

I'm using Apache Kafka. I dump huge dbs into Kafka, where each database's table is a topic.
I cannot delete topic before it's completely consumed. I cannot set time-based retention policy because I don't know when topic will be consumed. I have limitited disk and too much data. I have to write code that will orchestrate by consumption and deletion programmatically. I understand that the problem appear because we're using Kafka for batch processing, but I can't change technology stack.
What is the correct way to delete consumed topic from brokers?
Currently, I'm calling kafka.admin.AdminUtils#deleteTopic. But I can't find clear related documentation. The method signature doesn't contain kafka server URLs. Does that mean that I'm deleting only topic's metadata and broker's disk usage isn't reduced? So when real append-log file deletion happens?

Instead of using a time-based retention policy, are you able to use a size-based policy? log.retention.bytes is a per-partition setting that might help you out here.
I'm not sure how you'd want to determine that a topic is fully consumed, but calling deleteTopic against the topic initially marks it for deletion. As soon as there are no consumers/producers connected to the cluster and accessing those topics, and if delete.topic.enable is set to true in your server.properties file, the controller will then delete the topic from the cluster as soon as it is able to do so. This includes purging the data from disk. It can take anywhere between a few seconds and several minutes to do this.

JMS taking too long to process messages

An application has a JMS queue responsible for delivering audit logs. The application send logs to a JMS queue and this queue is consumed by a MDB.
However the messages sent are big XML files that vary from 20 MB to 100 MB. The problem is that the JMS queue take too long to consume the messages, leading to an OutOfMemory error.
What should I do to solve this problem?

This answer may of may not help jguilhermemv, just want to share an idea for those who can read this post, a work around for big messages.
First thing is try not to send to big messages, Now we have two options (But these require implementation changes, and can be done in starting or if system implementation changes are allowed in later stage):
Try to save the log in DB and send just log-ids in JMS msgs. (Saving logs in DB is not recommended as size and time to save will again be a problem in later stage.)
Save logs in form of files (Save them at a common location) and file names in DB and share those file name IDs via JMS. Consumer can then after consuming can read that log file.

create a kind of personal aggregation of all chat, group messages and different pubsub publications in xmpp

sorry for maybe a noobish question and my english.. I want to create a personal aggregation of all messages (chat, group) and posts (from pubsub services) with my xmpp client (e.g. new private messages and posts from different pubsubs will be aggregated in one place (read and unread messages). Furthermore is it possible to receive this aggregated stream with posts on different resources (even if some of the messages have been read on one device but on which not all the messages have been read)?
Is that possible with xmpp? Do I have to create a dedicated personal (user) pubsub to which I will forward (publish) all the messages (or a kind of a webservice for this with an access to a table "inbox" to store the messages). So whatever client of mine which goes online first will collect the private messages and posts from different pubsubs and then will forward to the dedicated pubsub (or web service) from which other resources of mine will get the messages because all the clients are also subscribed to the dedicated pubsub. Is my thinking right? I hope it's not all trash what I'm writing here..
Or is there a XEP for this?
Please, please help ..

In order to be able to notify and monitor other clients on different devices and at the same time need which messages are marked as unread in different customers you will need to write quite a lot boilerplate code.
For sure you will need a centralized web service which will receive the post streams (either in parallel with your client/s or first it will receive them and then send to the client/s). Pub/sub is suitable for this application but you will also need to send some additional data to the service from your clients like the time stamp of the last read message (in order to mark all newer as unread).
I think the easiest way would be to use the webservice as a gateway where all streams will be directed initially and where you can also monitor what is delivered and to which client.
Hope it helped

How can we save Java Message Queues for reference?

How can we keep track of every message that gets into our Java Message Queue? We need to save the message for later reference. We already log it into an application log (log4j) but we need to query them later.

You can store them
in memory - in a collection or in an in-memory database
in a standalone database

You could create a database logging table for the messages, storing the message as is in a BLOB column, the timestamp that it was created / posted to the MQ and a simple counter as primary key. You can also add fields like message type etc if you want to create statistical reports on messages sent.
Cleanup of the tabe can be done simply by deleting all message older than the retention period by using the timestamp column.

I implemented such a solution in the past, we chose to store messages with all their characteristics in a database and developed a search, replay and cancel application on top of it. This is the Message Store pattern:
(source: eaipatterns.com)
We also used this application for the Dead Letter Channel.
(source: eaipatterns.com)
If you don't want to build a custom solution, have a look at the ReplayService for JMS from CodeStreet.

The best way to do this is to use whatever tracing facility your middleware provider offers. Or possibly, you could set up an intermediate listener whose only job was to log messages and forward on to your existing application.
In most cases, you will find that the middleware provider already has the ability to do this for you with no changes or awareness by your application.

I would change the queue to a topic, and then keep the original consumer that processes the messages, and add another consumer for auditing the messages to a database.
Some JMS providers cater for topic-to-queue-bridge definitions, the consumers then receive from their own dedicated queues, and don't have to read past messages that are left on the queue due to other consumers being inactive.
Alternatively, you could write a log4j appender, which writes your logged messages to a database.

Ensuring serial processing of JMS messages in an OC4J cluster

We have an application that processes JMS message using a message driven bean. This application is deployed on an OC4J application server. (10.1.3)
We are planning to deploy this application on multiple OC4J application servers that will be configured to run in a cluster.
The problem is with JMS message processing in this cluster. We must ensure, that only a single message is being processed in the entire OC4J cluster at a single time. This is required, since the messages have to be processed in chronological order.
Do you know of a configuration parameter, that would control message processing across an OC4J cluster?
Or do you think we have to implement our own synchronisation code that will synchronise the message driven beans across the cluster?

I've done sequential processing of messages in a cluster on a pretty large scale - 1.5 million+ message/day, using a combination of the Competing Consumers pattern and a Lease pattern.
Here's the kicker, though - your requirement that you can only process one trans at a time is going to keep you from achieving your goals. We had the same basic requirement - messages had to be processed in order. At least, we thought we did. Then we had an epiphany - as we gave the problem more thought, we realized that we didn't require total ordering. We actually required ordering only within each account. Therefore, we could distribute the load across the servers in a cluster by assigning ranges of accounts to different servers in the cluster. Then, each server was responsible to process messages for a given account in order.
Here's the second clever part - we used a Lease pattern do dynamically assign account ranges to various servers in the cluster. If one server in the cluster went down, another would grab the lease and take over the first server's responsibility.
This worked for us, and the process lived in production for about 4 years before being replaced due to a company merger.
Edit:
I explain this solution in more detail here: http://coders-log.blogspot.com/2008/12/favorite-projects-series-installment-2.html
Edit:
Okay, gotcha. You're already doing the processing at the level you need, but since you're being deployed to a cluster, you need to make sure that only one instance of your MDB is actively pulling messages from the queue. Plus, you need the simplest workable solution.
You don't need to abandon your MDB mechanism that you have now, I don't think. Essentially what we're talking about here is a requirement for a distributed lock mechanism, not to put too fancy a phrase to it.
So, let me suggest this. At the point where your MDB registers to receive messages from the queue, it should check the distributed lock, and see if it can grab it. The first MDB to grab the lock wins, and only it will register to receive messages. So, now you have your serialization. What form should this lock take? There are many possibilities. Well, how about this. If you have access to a database, its transactional locking already provides some of what you need. Create a table with a single row. In the row is the identifier of the server that currently holds the lock, and an expiration time. This is the server's lease. Each server needs to have a way to generate its unique identifier, perhaps the server name plus a thread ID, for example.
If a server can get update access to the row, and the lease is expired, it should grab it. Otherwise, it gives up. If it grabs the lease, it needs to update the row with a time in the near future, like five minutes or so, and commit the update. The active server should update the lease before it expires. I recommend updating it when there's half the time remaining, so, every 2-1/2 minutes if the lease expires in five. With this, you now have failover. If the active MDB dies, another MDB (and only one) will take over.
That should be pretty straightforward, I think. Now, you want to have the dormant MDBs check the lock occasionally to see if it's freed up.
So, the active MDB and the dormant MDBs all have to do something periodically. You might have them spawn a separate thread to do this. Many application engine vendors won't be happy if you do this, but adding one thread is no big deal, especially since it spends most of its time sleeping. Another option would be to tie into the timer mechanism that many engines provide, and have it wake up your MDB periodically to check the lease.
Oh, and by the way - make sure the server admins employ NTP to keep the clocks reasonably synced.

First point: this is a pretty crappy design and you'll seriously limit performance only being able to process a single message at a time. I assume you are clustering only for fault tolerance, because you won't get performance improvements?
Are you using the default JMS implementation with OC4J or another one?
I've used IBM's MQ in the past and that had a feature that a queue could be marked as exclusive, which meant only one client could connect to it. This would appear to offer what you want.
An alternative would be to introduce a sequence ID (as simple as an incrementing counter) and the client processing the message would check that the sequence ID is the next expected value, if not then the message put back. This approach requires the different clients to persist the last valid sequence ID they've seen in some centrally shared data store, such as a database.

I agree with stevendick: May be you're off track with the design. Regarding sequence ID or similar approachs I suggest you get insight on messaging architectures with Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions (by Gregor Hohpe y Bobby Woolf). It's a great book, plenty of useful patterns... I'm sure the forces and the problem you are facing are well described there.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.