Reading data from Azure Blob with Spark - java

I am having issue in reading data from azure blobs via spark streaming
JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");
code like above works for HDFS, but is unable to read file from Azure blob
https://blobstorage.blob.core.windows.net/containerid/folder1/
Above is the path which is shown in azure UI, but this doesnt work, am i missing something, and how can we access it.
I know Eventhub are ideal choice for streaming data, but my current situation demands to use storage rather then queues

In order to read data from blob storage, there are two things that need to be done. First, you need to tell Spark which native file system to use in the underlying Hadoop configuration. This means that you also need the Hadoop-Azure JAR to be available on your classpath (note there maybe runtime requirements for more JARs related to the Hadoop family):
JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key.youraccount.blob.core.windows.net", "yourkey");
Now, call onto the file using the wasb:// prefix (note the [s] is for optional secure connection):
ssc.textFileStream("wasb[s]://<BlobStorageContainerName>#<StorageAccountName>.blob.core.windows.net/<path>");
This goes without saying that you'll need to have proper permissions set from the location making the query to blob storage.

As supplementary, there is a tutorial about HDFS-compatible Azure Blob storage with Hadoop which is very helpful, please see https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage.
Meanwhile, there is an offical sample on GitHub for Spark streaming on Azure. Unfortunately, the sample is written for Scala, but I think it's still helpful for you.

df = spark.read.format(“csv”).load(“wasbs://blob_container#account_name.blob.core.windows.net/example.csv”, inferSchema = True)

Related

Running ffprobe on file lying in Azure Blob storage

I am trying to get metadata of a file lying in Azure blob storage.
I am using ffprobe for this purpose. Though it works, since the ffprobe binary lies on my local system and file lies in Blob, the entire process is too slow
What would be the best way to do the above, getting meta data for a remote file?
Two ways for your reference:
1.Use blob.downloadAttributes(),then use blob.getMetadata()
This method populates the blob's system properties and user-defined
metadata. Before reading or modifying a blob's properties or metadata,
call this method or its overload to retrieve the latest values for the
blob's properties and metadata from the Microsoft Azure storage
service.
2.Use get-metadata-activity in ADF.
Get a file's metadata:

Preproces a large file using Nifi

We have files of up to 8GB that contain structured content, but important metadata is stored on the last line of the file which needs to be appended to each line of content. It is easy to use a ReverseFileReader to grab this last line, but that requires the file to be static on disk, and I cannot find a way to do this within our existing Nifi flow? Is this possible before the data is streamed to the content repository?
Processing 8 GB file in Nifi might be inefficient. You may try other option :-
ListSFTP --> ExecuteSparkInteractive --> RouteOnAttributes ----> ....
Here, you don't need to actually flow data through Nifi, Just pass file location ( could be hdfs or non-hdfs location) in nifi attribute and write either pyspark or spark scala code to read that file ( you can run this code through ExecuteSparkInteractive ). Code will be executed on spark cluster and only job result will be sent back to Nifi which you can further use to route your nifi flow (using RouteOnAttribute processor).
Note : You need Livy setup to run spark code from Nifi.
Hope this is helpful.

How to compress files on azure data lake store

I'm using Azure data lake store as a storage service for my Java app, sometimes I need to compress multiples files, what I do for now is I copy all files into the server compress them locally and then send the zip to azure, even though this is work it take a lot of time, so I'm wondering is there a way to compress files directly on azure, I checked the data-lake-store-SDK, but there's no such functionality.
Unfortunately, at the moment there is no option to do that sort of compression.
There is an open feature request HTTP compression support for Azure Storage Services (via Accept-Encoding/Content-Encoding fields) that discusses uploading compressed files to Azure Storage, but there is no estimation on when this feature might be released.
The only option for you is to implement such a mechanism on your own (using an Azure Function for example).
Hope it helps!

Upload File to Cloud Storage directly using SignedURL

I am trying to upload a file directly to Google Cloud Storage using Java Client Library
The Code I have written is
Instead of uploading the new file to cloud storage I am getting this output
What I am missing in the code to make the upload to Cloud Storage ?
You need configure the the authorization keys, is a file .json to you enverioment,see this in the documentation https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-gcloud
I don't think you have the correct "BUCKET_NAME" set, please compare the bucket name you are using with your bucket name on your Google Cloud Console so you can see if it's set correctly.
The way it's set, it looks like the compiler thought you were using a different constructor for your blobInfo.newBuilder method.

Create a temporary file, then upload it using FTP (Java webapp)

Users of my web application have an option to start a process that generates a CSV file (populated by some data from a database) and uploads it to an FTP server (and another department will read the file from there). I'm just trying to figure out how to best implement this. I use commons net ftp functionality. It offers two ways to upload data to the FTP server:
storeFile(String remote, InputStream local)
storeFileStream(String remote)
It can take a while to generate all the CSV data so I think keeping a connection open the whole time (storeFileStream) would not be the best way. That's why I want to generate a temporary file, populate it and only then transfer it.
What is the best way to generate a temporary file in a webapp? Is it safe and recommended to use File.createTempFile?
As long as you don't create thousands of CSV files concurrently the upload-time doesn't matter from my point of view. Databases usually output the data row by row and if this is already the format you need for the CSV file I strongly recommend not to use temporary files at all - just do the conversion on-the-fly:
Create an InputStream implementation that reads the database data row by row, converts it to CSV and publish the data via it's read() methods.
BTW: You mentioned that the conversion is done by a web application and that it can take a long time - this can be problematic as the default web client has a timeout. Therefore the long lasting process should be better done by a background thread only triggered by the webapp interface.
It is ok to use createTempFile, new File(tmpDir, UUID.randomUUID().toString()) can do as well. Just do not use deleteOnExit(), it is a leak master. Make sure you delete the file on your right own.
Edit: since you WILL have the data in memory, do not store it anywhere; wrap a java.io.ByteArrayInputSteam and use the method w/ the InputStream. Much neater and better solution.

Categories