Preproces a large file using Nifi

Preproces a large file using Nifi - java

We have files of up to 8GB that contain structured content, but important metadata is stored on the last line of the file which needs to be appended to each line of content. It is easy to use a ReverseFileReader to grab this last line, but that requires the file to be static on disk, and I cannot find a way to do this within our existing Nifi flow? Is this possible before the data is streamed to the content repository?

Processing 8 GB file in Nifi might be inefficient. You may try other option :-
ListSFTP --> ExecuteSparkInteractive --> RouteOnAttributes ----> ....
Here, you don't need to actually flow data through Nifi, Just pass file location ( could be hdfs or non-hdfs location) in nifi attribute and write either pyspark or spark scala code to read that file ( you can run this code through ExecuteSparkInteractive ). Code will be executed on spark cluster and only job result will be sent back to Nifi which you can further use to route your nifi flow (using RouteOnAttribute processor).
Note : You need Livy setup to run spark code from Nifi.
Hope this is helpful.

Related

Read CSV file in Spark kept in local using Java in cluster mode

I am trying to read a CSV file kept in my local filesystem in UNIX, while running it in cluster mode it's not able to find the CSV file.
In local mode, it can read both HDFS and file:/// files. However, in cluster mode, it can only read HDFS file.
Is there any suitable way to read without copying it into HDFS?

Remember that the executor needs to be able to access the file, so you have to take a stand from the executor nodes. As you mention HDFS, it means that the executor nodes must have access to your HDFS cluster.
If you want the Spark cluster to access a local file, consider NFS/SMB etc. However, something will end up copying the data.
I can update my answer if you add more details on your architecture.

Spooldir source stop processing

I have a distant server which generate files. The server push files each 15 min to hadoop cluster. These files are stored into a specific directory. We used flume to read files from local directory and send them to HDFS. However, SpoolDir is The suitable to process data.
The problem is flume shut down processing while the file is written into the directory.
I don't know how to make flume spooldir wait for a complete write of file , then process it.
Or how to block reading the file until it completly written, using a script shell or processor .
Someone can help me!

Set pollDelay property for spool source.
Spool dir source polls for new file at specific interval in the given directory.
By default value is 500ms.
Which is too fast for many systems so you should configure it accordingly.

NiFi Flowfile Attributes from KafkaConsumer

I have been trying to access NiFi Flowfile attributes from Kafka message in Spark Streaming. I am using Java as language.
The scenario is that NiFI reads binary files from FTP location using GetSFTP processor and publishes byte[] messages to Kafka using publishKafka processor. These byte[] attributes are converted to ASCII data using Spark Streaming job and these decoded ASCII are written to Kafka for for further processing as well as saving to HDFS using NiFi processor.
My problem is that I cannot keep track of binary filename and decoded ASCII file. I have to add a header section (for filename, filesize, records count etc) in my decoded ASCII but I am failed to figure out how to access file name from NiFi Flowfile from KafkaConsumer object. Is there a way that I can do this using standard NiFi processors? Or please share any other suggestions to achieve this functionality. Thanks.

So your data flow is:
FTP -> NiFi -> Kafka -> Spark Streaming -> Kafka -> NiFi -> HDFS
?
Currently Kafka doesn't have metadata attributes on each message (although I believe this may be coming in Kafka 0.11), so when NiFi publishes a message to a topic, it currently can't pass along the flow file attributes with the message.
You would have to construct some type of wrapper data format (maybe JSON or Avro) that contained the original content + the additional attributes you need, so that you could publish that whole thing as the content of one message to Kafka.
Also, I don't know exactly what you are doing in your Spark streaming job, but is there a reason you can't just do that part in NiFi? It doesn't sound like anything complex involving windowing or joins, so you could potentially simplify things a bit and have NiFi do the decoding, then have NiFi write it Kafka and to HDFS.

Reading data from Azure Blob with Spark

I am having issue in reading data from azure blobs via spark streaming
JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");
code like above works for HDFS, but is unable to read file from Azure blob
https://blobstorage.blob.core.windows.net/containerid/folder1/
Above is the path which is shown in azure UI, but this doesnt work, am i missing something, and how can we access it.
I know Eventhub are ideal choice for streaming data, but my current situation demands to use storage rather then queues

In order to read data from blob storage, there are two things that need to be done. First, you need to tell Spark which native file system to use in the underlying Hadoop configuration. This means that you also need the Hadoop-Azure JAR to be available on your classpath (note there maybe runtime requirements for more JARs related to the Hadoop family):
JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key.youraccount.blob.core.windows.net", "yourkey");
Now, call onto the file using the wasb:// prefix (note the [s] is for optional secure connection):
ssc.textFileStream("wasb[s]://<BlobStorageContainerName>#<StorageAccountName>.blob.core.windows.net/<path>");
This goes without saying that you'll need to have proper permissions set from the location making the query to blob storage.

As supplementary, there is a tutorial about HDFS-compatible Azure Blob storage with Hadoop which is very helpful, please see https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage.
Meanwhile, there is an offical sample on GitHub for Spark streaming on Azure. Unfortunately, the sample is written for Scala, but I think it's still helpful for you.

df = spark.read.format(“csv”).load(“wasbs://blob_container#account_name.blob.core.windows.net/example.csv”, inferSchema = True)

How do I know that Apache Camel route has no more files to copy

I am writing simple command line application, which copies files from ftp server to local drive. Lets assume that I am using the following route definition:
File tmpFile = File.createTempFile("repo", "dat");
IdempotentRepository<String> repository = FileIdempotentRepository.fileIdempotentRepository(tmpFile);
from("{{ftp.server}}")
.idempotentConsumer(header("CamelFileName"), repository)
.to("file:target/download")
.log("Downloaded file ${file:name} complete.");
where ftp.server is something like:
ftp://ftp-server.com:21/mypath?username=foo&password=bar&delay=5
Let's assume that files on the ftp server will not change over time. How do I check, whether the coping has finished or there are still some more file to copy? I need this, because I want to finish my app, once all file are copied.

Read about batch consumer
http://camel.apache.org/batch-consumer.html
The ftp consumer will set some exchange properties with the number of files, and if its the last file etc.

Do you have any control over the end that publishes the FTP files? E.g. is it your server and your client or can you make a request as a customer?
If so, you could ask for a flag file to be added at the end of their batch process. This is a single byte file with an agreed name that you watch for - when that file appears you know the batch is complete.
This is a useful technique if you regularly pull down huge files and they take a long time for a batch process to copy to disk at the server end. E.g. a file is produced by some streaming process.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Preproces a large file using Nifi - java

Related

Read CSV file in Spark kept in local using Java in cluster mode

Spooldir source stop processing

NiFi Flowfile Attributes from KafkaConsumer

Reading data from Azure Blob with Spark

How do I know that Apache Camel route has no more files to copy

Categories

Resources