Broker disk usage after topic deletion - java

I'm using Apache Kafka. I dump huge dbs into Kafka, where each database's table is a topic.
I cannot delete topic before it's completely consumed. I cannot set time-based retention policy because I don't know when topic will be consumed. I have limitited disk and too much data. I have to write code that will orchestrate by consumption and deletion programmatically. I understand that the problem appear because we're using Kafka for batch processing, but I can't change technology stack.
What is the correct way to delete consumed topic from brokers?
Currently, I'm calling kafka.admin.AdminUtils#deleteTopic. But I can't find clear related documentation. The method signature doesn't contain kafka server URLs. Does that mean that I'm deleting only topic's metadata and broker's disk usage isn't reduced? So when real append-log file deletion happens?

Instead of using a time-based retention policy, are you able to use a size-based policy? log.retention.bytes is a per-partition setting that might help you out here.
I'm not sure how you'd want to determine that a topic is fully consumed, but calling deleteTopic against the topic initially marks it for deletion. As soon as there are no consumers/producers connected to the cluster and accessing those topics, and if delete.topic.enable is set to true in your server.properties file, the controller will then delete the topic from the cluster as soon as it is able to do so. This includes purging the data from disk. It can take anywhere between a few seconds and several minutes to do this.

Related

Deleting Kafka messages after consumers' receipt

In my kafka java-project I want to delete the messages as soon as all interested consumers have received the new message. After some research I have found some old stackoverflow questions: here, one more and here. After reading all these, I've got some questions.
As far as i could understand, I really should rely on retention either by time or by space. However, the answers are old so maybe something changed? Is there any other way to really ensure that messages are deleted right after all the currently connected consumers have read the message? In this case I would need to check whether or not all consumers have read the message. Would I need a consumer-group for that?
Thank you in advance.
You don't need worry about when the messages will be deleted or compacted to achieve your goal
As far as your consumer is concerned, if the consumers commit the offset of every message they consumed, those processed messages, regardless of how long they will stay in the topic, are dead in the view of your consumers.
Note:
An administrative user can go and reset your consumer to read gain from the beginning including those messages your consumers moved on from. But that is a manual admin operation.

How to stream large files through Kafka?

I'm in the process of migrating an ACID-based monolith to an event-based microservice architecture. In the monolith potentially large files are stored in a database and I want to share this information (including the file content) with the microservices.
My approach would be to split the file into numbered blocks and send several messages (e.g. 1 FileCreatedMessage with metadata and an id followed by n FileContentMessage containing the block and its sequence number). On the receiving side messages may not arrive in order. Therefore I'd store the blocks from messages, order and join them and store the result.
Is there any approach which allows me to stream the data through Kafka with one message or another approach without the overhead of implementing the spliting, order and join logic for several messages?
I noticed Kafka Streams. It seems to solve different problems than this one.
Kafka is not the right approach for sending the large files. First, you need to ensure that chunks of one message will come to the same partition, so that they will be processed by the one instance of the consumer. The weak point here is that your consumer may fail in the middle loosing the chunks, it gathered. If you store the chunks in some storage (database) until all of them arrive, then you will need the separate process to assemble them. Your will also need to think about what happens if you loose a chunk or have an error during the processing of the chunk. We were thinking about this question in our company and decided not to send files through Kafka at all, keep them in storage and send the reference to them inside the message.
This article summarizes pros and cons.
Kafka streams will not help you here as it is the framework, which contains high level constructs for working with streams, but it just works over Kafka.
I try not to use Kafka to hold large file content. Instead, I store the file on a distributed file system (usually HDFS, but there are other good ones) and then put the URI into the Kafka message along with any other meta data I need. You do need to be careful of replication times within the distributed file system if processing your Kafka topic on a distributed streaming execution platform (e.g. Storm or Flink). There may be instances where the Kafka message is processed before the DFS can replicate the file for access by the local system, but that's easier to solve than the problems caused by storing large file content in Kafka.

JMS taking too long to process messages

An application has a JMS queue responsible for delivering audit logs. The application send logs to a JMS queue and this queue is consumed by a MDB.
However the messages sent are big XML files that vary from 20 MB to 100 MB. The problem is that the JMS queue take too long to consume the messages, leading to an OutOfMemory error.
What should I do to solve this problem?
This answer may of may not help jguilhermemv, just want to share an idea for those who can read this post, a work around for big messages.
First thing is try not to send to big messages, Now we have two options (But these require implementation changes, and can be done in starting or if system implementation changes are allowed in later stage):
Try to save the log in DB and send just log-ids in JMS msgs. (Saving logs in DB is not recommended as size and time to save will again be a problem in later stage.)
Save logs in form of files (Save them at a common location) and file names in DB and share those file name IDs via JMS. Consumer can then after consuming can read that log file.

Easiest point-multipoint data distribution library/framework in Java

I have a table in a central database that gets constantly appended to. I want to batch these appends every couple of minutes and have them sent to a bunch of "slave" servers. These servers will, in turn, process that data and then discard it (think distributed warehousing). Each "slave" server needs a different subset of the data and there is a natural partitioning key I can use for that.
Basically, I need this to be eventually consistent : every batch of data to be eventually delivered to every "slave" server (reliability), even if the "slave" was down at the moment the batch is ready to be delivered (durability). I don't care about the order in which the batches are delivered.
Possible solutions I have considered,
MySQL replication does not fit my requirement because I would have to replicate the whole table on each server.
ETL & ESB products are too bloated for this, I am not doing any data processing.
Plain JMS, I could use but I'm looking for something even simpler
JGroups is interesting but members that are left the group will not get the messages once they rejoin.
Pushing files & ack files across servers : can do but I don't know of any framework so would need to write my own.
Note : This question is about how to move the data from the central server to the N others with reliability & durability; not how to create or ingest it.
(Edited on Aug 24 to add durability requirement)
You may use JGroups for same. Its a toolkit for reliable multicast communication
I ended up finding "Spring integration" which includes plugins to poll directories via SFTP for example.
http://www.springsource.org/spring-integration

How can we save Java Message Queues for reference?

How can we keep track of every message that gets into our Java Message Queue? We need to save the message for later reference. We already log it into an application log (log4j) but we need to query them later.
You can store them
in memory - in a collection or in an in-memory database
in a standalone database
You could create a database logging table for the messages, storing the message as is in a BLOB column, the timestamp that it was created / posted to the MQ and a simple counter as primary key. You can also add fields like message type etc if you want to create statistical reports on messages sent.
Cleanup of the tabe can be done simply by deleting all message older than the retention period by using the timestamp column.
I implemented such a solution in the past, we chose to store messages with all their characteristics in a database and developed a search, replay and cancel application on top of it. This is the Message Store pattern:
(source: eaipatterns.com)
We also used this application for the Dead Letter Channel.
(source: eaipatterns.com)
If you don't want to build a custom solution, have a look at the ReplayService for JMS from CodeStreet.
The best way to do this is to use whatever tracing facility your middleware provider offers. Or possibly, you could set up an intermediate listener whose only job was to log messages and forward on to your existing application.
In most cases, you will find that the middleware provider already has the ability to do this for you with no changes or awareness by your application.
I would change the queue to a topic, and then keep the original consumer that processes the messages, and add another consumer for auditing the messages to a database.
Some JMS providers cater for topic-to-queue-bridge definitions, the consumers then receive from their own dedicated queues, and don't have to read past messages that are left on the queue due to other consumers being inactive.
Alternatively, you could write a log4j appender, which writes your logged messages to a database.

Categories