I have been trying to access NiFi Flowfile attributes from Kafka message in Spark Streaming. I am using Java as language.
The scenario is that NiFI reads binary files from FTP location using GetSFTP processor and publishes byte[] messages to Kafka using publishKafka processor. These byte[] attributes are converted to ASCII data using Spark Streaming job and these decoded ASCII are written to Kafka for for further processing as well as saving to HDFS using NiFi processor.
My problem is that I cannot keep track of binary filename and decoded ASCII file. I have to add a header section (for filename, filesize, records count etc) in my decoded ASCII but I am failed to figure out how to access file name from NiFi Flowfile from KafkaConsumer object. Is there a way that I can do this using standard NiFi processors? Or please share any other suggestions to achieve this functionality. Thanks.
So your data flow is:
FTP -> NiFi -> Kafka -> Spark Streaming -> Kafka -> NiFi -> HDFS
?
Currently Kafka doesn't have metadata attributes on each message (although I believe this may be coming in Kafka 0.11), so when NiFi publishes a message to a topic, it currently can't pass along the flow file attributes with the message.
You would have to construct some type of wrapper data format (maybe JSON or Avro) that contained the original content + the additional attributes you need, so that you could publish that whole thing as the content of one message to Kafka.
Also, I don't know exactly what you are doing in your Spark streaming job, but is there a reason you can't just do that part in NiFi? It doesn't sound like anything complex involving windowing or joins, so you could potentially simplify things a bit and have NiFi do the decoding, then have NiFi write it Kafka and to HDFS.
Related
I have a streaming application where I using flink. In this, I'm reading from a kafka source which contains a file id. This file contains data which I need for further processing. This data is to be persisted in database.
Read Kafka message -> Read file id from kafka -> Read file using that file id -> Write records to DB
Now, for each kafka input, once all the above processing is done, I need to notify upstream people that I have completed processing that input kafka message by producing another kafka message of completion.
Note: The input kafka messages will keep on coming at different intervals.
Right now, I'm stuck in figuring out when should I write the final message of completion to kafka in the flink pipeline. How can I know that I have finished processing the file and now is the time to produce the final kafka message.
P.S.: I'm newbie to apache flink.
Thanks in advance.
I was asked to write a code to send a .csv file to S3 using Amazon Kinesis Firehose. But as someone who has never used Kinesis, I have no idea how I should do this. Can you help with this, or if you have a code that does this job, it can also help (Java or Scala).
csv data should be sent to Kinesis Firehose to be written to a S3 bucket in gzip format using a Firehose client application.
Thanks in advance.
Firstly, Firehose is streaming to send a record (or records) to a destination, not a file transfer such as copy a csv file to S3. You can use S3 CLI commands if you need to copy files from somewhere to S3.
So please first make sure what you need to do is streaming or file copy. If it is not streaming, then I wonder why Firehose.
There are multiple input source you can use. First better decide which way to use.
If you use JAVA+AWS SDK, then probably PutRecord API call would be the way
Writing to Kinesis Data Firehose Using the AWS SDK
aws-sdk-java/src/samples/AmazonKinesisFirehose/
Put data to Amazon Kinesis Firehose delivery stream using Spring Boot
If you can use AWS Amazon Linux to send the data to Firehose, Firehose Agent will be easier. It just monitor a file and can send the deltas to S3.
enter link description here
I'm in the process of migrating an ACID-based monolith to an event-based microservice architecture. In the monolith potentially large files are stored in a database and I want to share this information (including the file content) with the microservices.
My approach would be to split the file into numbered blocks and send several messages (e.g. 1 FileCreatedMessage with metadata and an id followed by n FileContentMessage containing the block and its sequence number). On the receiving side messages may not arrive in order. Therefore I'd store the blocks from messages, order and join them and store the result.
Is there any approach which allows me to stream the data through Kafka with one message or another approach without the overhead of implementing the spliting, order and join logic for several messages?
I noticed Kafka Streams. It seems to solve different problems than this one.
Kafka is not the right approach for sending the large files. First, you need to ensure that chunks of one message will come to the same partition, so that they will be processed by the one instance of the consumer. The weak point here is that your consumer may fail in the middle loosing the chunks, it gathered. If you store the chunks in some storage (database) until all of them arrive, then you will need the separate process to assemble them. Your will also need to think about what happens if you loose a chunk or have an error during the processing of the chunk. We were thinking about this question in our company and decided not to send files through Kafka at all, keep them in storage and send the reference to them inside the message.
This article summarizes pros and cons.
Kafka streams will not help you here as it is the framework, which contains high level constructs for working with streams, but it just works over Kafka.
I try not to use Kafka to hold large file content. Instead, I store the file on a distributed file system (usually HDFS, but there are other good ones) and then put the URI into the Kafka message along with any other meta data I need. You do need to be careful of replication times within the distributed file system if processing your Kafka topic on a distributed streaming execution platform (e.g. Storm or Flink). There may be instances where the Kafka message is processed before the DFS can replicate the file for access by the local system, but that's easier to solve than the problems caused by storing large file content in Kafka.
We have files of up to 8GB that contain structured content, but important metadata is stored on the last line of the file which needs to be appended to each line of content. It is easy to use a ReverseFileReader to grab this last line, but that requires the file to be static on disk, and I cannot find a way to do this within our existing Nifi flow? Is this possible before the data is streamed to the content repository?
Processing 8 GB file in Nifi might be inefficient. You may try other option :-
ListSFTP --> ExecuteSparkInteractive --> RouteOnAttributes ----> ....
Here, you don't need to actually flow data through Nifi, Just pass file location ( could be hdfs or non-hdfs location) in nifi attribute and write either pyspark or spark scala code to read that file ( you can run this code through ExecuteSparkInteractive ). Code will be executed on spark cluster and only job result will be sent back to Nifi which you can further use to route your nifi flow (using RouteOnAttribute processor).
Note : You need Livy setup to run spark code from Nifi.
Hope this is helpful.
I have a requirement to continuously transmit large video files from one system to another. Is Kafka suitable for transmitting large media content? What are the considerations that I must take into account before opting for this solution?
You could use Kafka to send messages which point to an external reference to the large video files. Then the receiver can dowload the new file from this external storage (like for example Amazon S3 buckets). This is called the "Claim Check Pattern" and is documented here http://www.enterpriseintegrationpatterns.com/patterns/messaging/StoreInLibrary.html
However, Kafka is not designed to transport the large video files themselves. It is not a managed file transfer tool.
You could chunk up the files and put each chunk as a message into Kafka then recombine the chunks on the other end. By default Kafka message size is 1MB. This is configurable.