How to stream large files through Kafka?

How to stream large files through Kafka? - java

I'm in the process of migrating an ACID-based monolith to an event-based microservice architecture. In the monolith potentially large files are stored in a database and I want to share this information (including the file content) with the microservices.
My approach would be to split the file into numbered blocks and send several messages (e.g. 1 FileCreatedMessage with metadata and an id followed by n FileContentMessage containing the block and its sequence number). On the receiving side messages may not arrive in order. Therefore I'd store the blocks from messages, order and join them and store the result.
Is there any approach which allows me to stream the data through Kafka with one message or another approach without the overhead of implementing the spliting, order and join logic for several messages?
I noticed Kafka Streams. It seems to solve different problems than this one.

Kafka is not the right approach for sending the large files. First, you need to ensure that chunks of one message will come to the same partition, so that they will be processed by the one instance of the consumer. The weak point here is that your consumer may fail in the middle loosing the chunks, it gathered. If you store the chunks in some storage (database) until all of them arrive, then you will need the separate process to assemble them. Your will also need to think about what happens if you loose a chunk or have an error during the processing of the chunk. We were thinking about this question in our company and decided not to send files through Kafka at all, keep them in storage and send the reference to them inside the message.
This article summarizes pros and cons.
Kafka streams will not help you here as it is the framework, which contains high level constructs for working with streams, but it just works over Kafka.

I try not to use Kafka to hold large file content. Instead, I store the file on a distributed file system (usually HDFS, but there are other good ones) and then put the URI into the Kafka message along with any other meta data I need. You do need to be careful of replication times within the distributed file system if processing your Kafka topic on a distributed streaming execution platform (e.g. Storm or Flink). There may be instances where the Kafka message is processed before the DFS can replicate the file for access by the local system, but that's easier to solve than the problems caused by storing large file content in Kafka.

Related

Reading huge file and writing in RDBMS

I have a huge text file which is continuously getting appended from a common place, which I need to read line by line from my java application and update in a SQL RDBMS such that if java application crashes, it should start from where it left and not from the beginning.
its a plain text file. Each row will contains:
<Datatimestamp> <service name> <paymentType> <success/failure> <session ID>
Also the data which is retrieved from database should also be real time without any performance, availability or availability issues in web application
Here is my approach:
Deploy application in two systems boxes with each contains heartbeat which pings the other system for service availability.
When you get a success response to heart beat,you also get the time stamp which is last successfully read.
When the next heartbeat response fails, application in another system can take over, based on:
1. failed response
2. Last successful time stamp.
Also, since the need for data retrieval is very real time and data is huge, can I crawl the database put that into Solr or Elastic search for faster retrieval, instead of making the database calls ?
There are various ways to do it, what is the best way.

I would put a messaging system in between the text file and the DB writing applications. (for example RabbitMQ) in this case, the messaging system functions as a queue. one application constantly reads the file and inserts the rows as messages to the broker. on the other side, multiple "DB writing applications" can read from the queue and write to DB.
the advantage of the messaging system is its support for multiple clients reading from the queue. the messaging system takes care of synchronizing between the clients, dealing with errors, dead letters, etc. the clients don't care about what payload was processed by other instances.
regarding maintaining multiple instances of "DB writing applications": I would go for ready made cluster solutions. perhaps docker cluster managed by kubernates?
another viable alternative is a streaming platform, like Apache Kafka.

You can use a software like FileBeat to read the file and direct the filebeat output to RabbitMQ or Kafka. From there a Java program can subscribe / consume the data and put it into a RDBMS system.

Using AWS S3 as an intermediate storage layer for monitoring platform

We have a use case where we want to use S3 to push event based + product metrics temporarily until they are loaded in a relational data warehouse (Oracle). These metrics would be sent by more than 200 application servers to S3 and persisted in different files per metric per server. The frequency of some of the metrics could be high for e.g. sending number of active http sessions on the app server every minute or the memory usage per minute. Once the metrics are persisted in S3, we would have something on the data warehouse that would read the csv file and load them in Oracle. We thought of using S3 over a queue (kafka/activemq/rabbit mq) due to various factors including cost, durability and replication. I have a few questions related to the write and read mechanisms with S3
For event based metrics, how can we write to S3 such that the app server is not blocked? I see that the java sdk does support asynchronous writes. Would that guarantee deliveries?
How can we update a csv file created on S3 by appending a record? From what I have read we cannot update an S3 object. What would be an efficient way for pushing monitoring metrics to S3 at periodic intervals?
When reading from S3, performance isn't a critical requirement. What would be an optimized way of loading the csv files into Oracle? A couple of ways included using the get object api from java sdk or mount S3 folders as NFS shares and creating external tables. Are there any other efficient ways of reading?
Thanks

FYI, 200 servers sending one request per minute is not "high". You are likely over engineering this. SQS is simple, highly redundant/available, and would likely meet your needs far better than growing your own solution.
To answer your questions in detail:
1) No, you cannot "guarantee delivery", especially with asynchronous S3 operations. You could design recoverable operations, but not guaranteed delivery.
2) That isn't what S3 is for... It's whole object writing... You would want to create a system where you add lots of small files... You probably don't want to do this. Updating a file (especially from multiple threads) is dangerous, each update will replace the entire file...
3) If you must do this, use the object api, process each file one-at-a-time, and delete them when you are done... You are much better off building a queue-based system.

Broker disk usage after topic deletion

I'm using Apache Kafka. I dump huge dbs into Kafka, where each database's table is a topic.
I cannot delete topic before it's completely consumed. I cannot set time-based retention policy because I don't know when topic will be consumed. I have limitited disk and too much data. I have to write code that will orchestrate by consumption and deletion programmatically. I understand that the problem appear because we're using Kafka for batch processing, but I can't change technology stack.
What is the correct way to delete consumed topic from brokers?
Currently, I'm calling kafka.admin.AdminUtils#deleteTopic. But I can't find clear related documentation. The method signature doesn't contain kafka server URLs. Does that mean that I'm deleting only topic's metadata and broker's disk usage isn't reduced? So when real append-log file deletion happens?

Instead of using a time-based retention policy, are you able to use a size-based policy? log.retention.bytes is a per-partition setting that might help you out here.
I'm not sure how you'd want to determine that a topic is fully consumed, but calling deleteTopic against the topic initially marks it for deletion. As soon as there are no consumers/producers connected to the cluster and accessing those topics, and if delete.topic.enable is set to true in your server.properties file, the controller will then delete the topic from the cluster as soon as it is able to do so. This includes purging the data from disk. It can take anywhere between a few seconds and several minutes to do this.

How can I handle large files processing via messaging queries in Microservices environment?

Many people suggest that the good way for organizing IPC (ImicroservicesC) is asynchronous communication via queries like Kafka and JMS.
But what if I need to pass large data files between services?
Suppose I have a Video Microservice and a Publisher Microservice. The first one receives videos from the user, verifies and sends them to Publisher for converting and publishing. It's oblivious video can be a very large file and it can overload messaging system (Kafka is not suitable for big messages at all). Of course, I can share one database for them and send video_id via Kafka, but it couples these services and its not a real microservices architecture anymore.
Do you have similar situations in practice? How do you handle it?
Thanks

There is an Enterprise Integration Pattern from the book by Hohpe/Wolfe called the Claim Check Pattern that addresses these concerns.
Essentially the big blob is removed from the message and stored somewhere that both sender and receiver can access, whether that be a common file share, FTP server, an Amazon S3 blob, whatever. It leaves a "claim check" behind: some sort of address that describes how to find the blob back.
The tiny message can then be transmitted over Kafka/JMS, or some other message queue system, most of which are fairly bad at dealing with large data blobs.
Of course, a very simple implementation is to leave the files on a file share and only refer to them by file path.
It's more complex when it's preferable to have the blob integrated with the rest of the message, requiring a true Claim Check implementation. This can be handled at an infrastructure level so the message sender and receiver don't need to know any of the details behind how the data is transmitted.
I know that you're in the Java landscape, but in NServiceBus (I work for Particular Software, the makers of NServiceBus) this pattern is implemented with the Data Bus feature in a message pipeline step. All the developer needs to do is identify what type of message properties apply to the data bus, and (in the default file share implementation) configure the location where files are stored. Developers are also free to provide their own data bus implementation.
One thing to keep in mind is that with the blobs disconnected from the messages, you have to provide for cleanup. If the messages are one-way, you could clean them up as soon as the message is successfully processed. With Kafka (not terribly familiar) there's a possibility to process messages from a stream multiple times, correct? If so you'd want to wait until it was no longer possible to process that message. Or, if the Publish/Subscribe pattern is use, you would not want to clean up the files until you were sure all subscribers had a chance to be processed. In order to accomplish that, you'd need to set an SLA (a timespan that each message must be processed within) on the message and clean up the blob storage after that timespan had elapsed.
In any case, lots of things to consider, which make it much more useful to implement at an infrastructure level rather than try to roll your own in each instance.

Easiest point-multipoint data distribution library/framework in Java

I have a table in a central database that gets constantly appended to. I want to batch these appends every couple of minutes and have them sent to a bunch of "slave" servers. These servers will, in turn, process that data and then discard it (think distributed warehousing). Each "slave" server needs a different subset of the data and there is a natural partitioning key I can use for that.
Basically, I need this to be eventually consistent : every batch of data to be eventually delivered to every "slave" server (reliability), even if the "slave" was down at the moment the batch is ready to be delivered (durability). I don't care about the order in which the batches are delivered.
Possible solutions I have considered,
MySQL replication does not fit my requirement because I would have to replicate the whole table on each server.
ETL & ESB products are too bloated for this, I am not doing any data processing.
Plain JMS, I could use but I'm looking for something even simpler
JGroups is interesting but members that are left the group will not get the messages once they rejoin.
Pushing files & ack files across servers : can do but I don't know of any framework so would need to write my own.
Note : This question is about how to move the data from the central server to the N others with reliability & durability; not how to create or ingest it.
(Edited on Aug 24 to add durability requirement)

You may use JGroups for same. Its a toolkit for reliable multicast communication

I ended up finding "Spring integration" which includes plugins to poll directories via SFTP for example.
http://www.springsource.org/spring-integration

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.