Kafka - Transmitting large media content - java

I have a requirement to continuously transmit large video files from one system to another. Is Kafka suitable for transmitting large media content? What are the considerations that I must take into account before opting for this solution?

You could use Kafka to send messages which point to an external reference to the large video files. Then the receiver can dowload the new file from this external storage (like for example Amazon S3 buckets). This is called the "Claim Check Pattern" and is documented here http://www.enterpriseintegrationpatterns.com/patterns/messaging/StoreInLibrary.html
However, Kafka is not designed to transport the large video files themselves. It is not a managed file transfer tool.

You could chunk up the files and put each chunk as a message into Kafka then recombine the chunks on the other end. By default Kafka message size is 1MB. This is configurable.

Related

How to send Huge data sets to Client from Server

I have a Requirement to send Data from Server (Tomcat : Java Process, Odata API's) to Client (React Based)
Data can range from few KB's to Hundred's of MB (Say 700 MB) which is retrieved from DB : RedShift , Processed and Sent to Client.
There can be multiple clients accessing at the same time as well to keep more stress on the system.
We added Pagination so that data for that page alone is loaded, but we have a functionality to export complete data set in CSV format.
Processing of all the data is consuming lot of memory and application's heap gets exhausted sometime, Increasing heap is not the solution expected, I want to know from Application side anything can be done to Optimize system resources.
Kindly suggest what could be the best way to transfer data also whould like to see if there are any other kind of API(Streaming) which can help me here
Can you change the integration between client and your system?
Something like: the client sends the request to export a CSV with a callback url in payload.
You put this request in a queue (rabbitmq). The queue consumer process the request, generate the CSV and put it in a temporary area (S3 or behind a NGINX). Then your consumer notifies the client in the callback url with then new url for the client to download de full CSV.
This way, the system that process the incoming requests don't use too much heap. You only need to scale the queue consumer, but it's more easy because the concurrency is your configuration of how many consumers are consuming the messages, not the incoming requests from the clients.

How to stream large files through Kafka?

I'm in the process of migrating an ACID-based monolith to an event-based microservice architecture. In the monolith potentially large files are stored in a database and I want to share this information (including the file content) with the microservices.
My approach would be to split the file into numbered blocks and send several messages (e.g. 1 FileCreatedMessage with metadata and an id followed by n FileContentMessage containing the block and its sequence number). On the receiving side messages may not arrive in order. Therefore I'd store the blocks from messages, order and join them and store the result.
Is there any approach which allows me to stream the data through Kafka with one message or another approach without the overhead of implementing the spliting, order and join logic for several messages?
I noticed Kafka Streams. It seems to solve different problems than this one.
Kafka is not the right approach for sending the large files. First, you need to ensure that chunks of one message will come to the same partition, so that they will be processed by the one instance of the consumer. The weak point here is that your consumer may fail in the middle loosing the chunks, it gathered. If you store the chunks in some storage (database) until all of them arrive, then you will need the separate process to assemble them. Your will also need to think about what happens if you loose a chunk or have an error during the processing of the chunk. We were thinking about this question in our company and decided not to send files through Kafka at all, keep them in storage and send the reference to them inside the message.
This article summarizes pros and cons.
Kafka streams will not help you here as it is the framework, which contains high level constructs for working with streams, but it just works over Kafka.
I try not to use Kafka to hold large file content. Instead, I store the file on a distributed file system (usually HDFS, but there are other good ones) and then put the URI into the Kafka message along with any other meta data I need. You do need to be careful of replication times within the distributed file system if processing your Kafka topic on a distributed streaming execution platform (e.g. Storm or Flink). There may be instances where the Kafka message is processed before the DFS can replicate the file for access by the local system, but that's easier to solve than the problems caused by storing large file content in Kafka.

Best strategy to upload files with unknown size to S3

I have a server-side application that runs through a large number of image URLs and uploads the images from these URLs to S3.
The files are served over HTTP. I download them using InputStream I get from an HttpURLConnection using the getInputStream method. I hand the InputStream to AWS S3 Client putObject method (AWS Java SDK v1) to upload the stream to S3. So far so good.
I am trying to introduce a new external image data source. The problem with this data source is that the HTTP server serving these images does not return a Content-Length HTTP header. This means I cannot tell how many bytes the image will be, which is a number required by the AWS S3 client to validate the image was correctly uploaded from the stream to S3.
The only ways I can think of dealing with this issue is to either get the server owner to add Content-Length HTTP header to their response (unlikely), or to download the file to a memory buffer first and then upload it to S3 from there.
These are not big files, but I have many of them.
When considering downloading the file first, I am worried about the memory footprint and concurrency implications (not being able to upload and download chunks of the same file at the same time).
Since I am dealing with many small files, I suspect that concurrency issues might be "resolved" if I focus on the concurrency of the multiple files instead of a single file. So instead of concurrently downloading and uploading chunks of the same file, I will use my IO effectively downloading one file while uploading another.
I would love your ideas on how to do this, best practices, pitfalls or any other thought on how to best tackle this issue.

Using AWS S3 as an intermediate storage layer for monitoring platform

We have a use case where we want to use S3 to push event based + product metrics temporarily until they are loaded in a relational data warehouse (Oracle). These metrics would be sent by more than 200 application servers to S3 and persisted in different files per metric per server. The frequency of some of the metrics could be high for e.g. sending number of active http sessions on the app server every minute or the memory usage per minute. Once the metrics are persisted in S3, we would have something on the data warehouse that would read the csv file and load them in Oracle. We thought of using S3 over a queue (kafka/activemq/rabbit mq) due to various factors including cost, durability and replication. I have a few questions related to the write and read mechanisms with S3
For event based metrics, how can we write to S3 such that the app server is not blocked? I see that the java sdk does support asynchronous writes. Would that guarantee deliveries?
How can we update a csv file created on S3 by appending a record? From what I have read we cannot update an S3 object. What would be an efficient way for pushing monitoring metrics to S3 at periodic intervals?
When reading from S3, performance isn't a critical requirement. What would be an optimized way of loading the csv files into Oracle? A couple of ways included using the get object api from java sdk or mount S3 folders as NFS shares and creating external tables. Are there any other efficient ways of reading?
Thanks
FYI, 200 servers sending one request per minute is not "high". You are likely over engineering this. SQS is simple, highly redundant/available, and would likely meet your needs far better than growing your own solution.
To answer your questions in detail:
1) No, you cannot "guarantee delivery", especially with asynchronous S3 operations. You could design recoverable operations, but not guaranteed delivery.
2) That isn't what S3 is for... It's whole object writing... You would want to create a system where you add lots of small files... You probably don't want to do this. Updating a file (especially from multiple threads) is dangerous, each update will replace the entire file...
3) If you must do this, use the object api, process each file one-at-a-time, and delete them when you are done... You are much better off building a queue-based system.

Store a file inside an object

I have a Java client/server desktop application, where the communication between client and server is based on Sockets, and the messages exchanged between client and server are serialized objects (message objects, that incapsulate requests and responses).
Now I need to make the client able to upload a file from the local computer to the server, but I can't send the File through the buffer, since the Buffer is already used for exchanging the message objects.
Should i open another stream to send the file, or is there any better way to upload a file for my situation?
I need to make the client able to upload a file from the local computer to the server
- Open a Solely Dedicated Connection to the Server for File uploading.
- Use File Transfer Protocol to ease your work, and moreover its quite easy and reliable to use the Apache's common lib for File uploading and downloading....
See this link:
http://commons.apache.org/net/
You really only have two options:
Open another connection dedicated to the file upload and send it through that.
Make a message object representing bits of a file being uploaded, and send the file in chunks via these message objects.
The former seems simpler & cleaner to me, requiring less overhead and less complicated code.
You can keep your solution and pass the file content as an object, for example as a String - use Base64 encoding (or similar) of the content if it contains troublesome characters

Categories