I have an Azure Blob with many containers. Each container has multiple folders - and each folder has a bunch of files in it. I want to be able to grab all of the files and return them zipped. I'm currently only able to get one file at a time...
public void downloadAllFromBlob(String containerName){
CloudBlobClient blobClient = this.storageAccount.createCloudBlobClient();
try{
CloudBlobContainer container = blobClient.getContainerReference(containerName);
if(container.exists()){
// I want to grab all the files in the container and zip them
for(ListBlobItem blobItem: container.listBlobs()){
// i'm only able to list/VIEW the blobs, and not go into one and get all the contents
}
}
}catch(){
}
}
Unfortunately there's no batch retrieve capability available in Azure Blob Storage. You need need to download each blob individually as you showed above. You can try to retrieve blobs in parallel to speed things up.
Related
I have a java application which I use to upload some artifacts to Azure Blob Storage. Now I am trying to move these files from one container to another. Below is a sample code on what I do right now.
BlobServiceClient blobServiceClient = new BlobServiceClientBuilder().connectionString(CONNECTION_STRING).buildClient();
BlobContainerClient releaseContainer = blobServiceClient.getBlobContainerClient(RELEASE_CONTAINER);
BlobContainerClient backupContainer = blobServiceClient.getBlobContainerClient(BACKUP_CONTAINER);
for (BlobItem blobItem : releaseContainer.listBlobs()) {
BlobClient destBlobClient = backupContainer.getBlobClient(blobItem.getName());
BlobClient sourceBlobClient = releaseContainer.getBlobClient(blobItem.getName());
destBlobClient.copyFromUrl(sourceBlobClient.getBlobUrl());
sourceBlobClient.delete();
}
Is there a more straight forward way to do this?
Also if this is the way to do it how can I delete the old file? sourceBlobClient.delete() doesn't work now.
There is similar discussion in SO can you refer to the suggestion and let me know the status
Additional information: Have you check Azcopy tool, one of the best to tool to move/copy data
I see you have posted the similar query in the Q&A Forum
You can move files using the rename method from. azure-storage-file-datalake java client.
DataLakeServiceClient storageClient = new DataLakeServiceClientBuilder().endpoint(endpoint).credential(credential).buildClient();
DataLakeFileSystemClient dataLakeFileSystemClient = storageClient.getFileSystemClient("storage");
DataLakeFileClient fileClient = dataLakeFileSystemClient.getFileClient("src/path");
fileClient.rename("storage", "dest/path");
method documentation here
Note that the azure-storage-file-datalake java client gives you great options to deal with your storage account as a filesystem. But take care there's no way to copy files there.
I'm somewhat of a beginner and have never dealt with cloud-based solutions yet before.
My program uses the PDFBox library to extract data from PDFs and rename the file based on the data. It's all local currently, but eventually will need to be deployed as an Azure Function. The PDFs will be stored in an Azure Blob Container - the Azure Blob Storage trigger for Azure Functions is an important reason for this choice.
Of course I can download the blob locally and read it, but the program should run solely in the Cloud. I've tried reading the blobs directly using Java, but this resulted in gibberish data and wasn't compatible with PDFbox. My plan for now is to temp store the files elsewhere in the Cloud (e.g. OneDrive, Azure File Storage) and try opening them from there. However, this seems like it can quickly turn into an overly messy solution. My questions:
(1) Is there any way a blob can be opened as a File, rather than a CloudBlockBlob so this additional step isn't needed?
(2) If no, what would be a recommended temporary storage be in this case?
(3) Are there any alternative ways to approach this issue?
Since you are planning Azure function, you can use blob trigger/binding to get the bytes directly. Then you can use PDFBox PdfDocument load method to directly build the object PDDocument.load(content). You won't need any temporary storage to store the file to load that.
#FunctionName("blobprocessor")
public void run(
#BlobTrigger(name = "file",
dataType = "binary",
path = "myblob/{name}",
connection = "MyStorageAccountAppSetting") byte[] content,
#BindingName("name") String filename,
final ExecutionContext context
) {
context.getLogger().info("Name: " + filename + " Size: " + content.length + " bytes");
PDDocument doc = PDDocument.load(content);
// do your stuffs
}
I want to copy data from blob e.g storageaccount/container/folder1/folder2/folder3 . Now I want to copy folder3 data to another subscription container blob.
I am using java and azure sdk, startcopy to copy source to destination using SAS. but everytime it says that blob does not exist.
But if source path I give like this : storageaccount/container/folder1/folder2/folder3/xyz.txt then it is able to copy data from source to destination.
Cant we copy whole folder3 data to destination?instead of looping through all the files?
You mention startcopy method, suppose you are using v8 sdk. You say when use storageaccount/container/folder1/folder2/folder3 it will say that blob does not exist, cause you just provide a directory and the startcopy need the CloudBlockBlob object.
So the right way should be list the blobs under the directory, then loop the blobs and copy the blob. The below is my test code, for test I just copy a directory to another container.
CloudStorageAccount storageAccount = CloudStorageAccount.parse(connectStr);
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
try {
CloudBlobContainer container = blobClient.getContainerReference("test");
Iterable<ListBlobItem> blobs=container.listBlobs("testfolder/");
CloudBlobContainer destcontainer=blobClient.getContainerReference("testcontainer");
for(ListBlobItem blob:blobs){
CloudBlockBlob srcblob=new CloudBlockBlob(blob.getUri());
CloudBlockBlob destblob= destcontainer.getBlockBlobReference(srcblob.getName());
destblob.startCopy(srcblob);
}
} catch (StorageException e) {
e.printStackTrace();
}
Update: about the status about copy action, there is a method getCopyState, you could get the state details, hope this is what you want. More details check the method.
CopyState st=destblob.getCopyState();
System.out.println(st.getStatus());
I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.
Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}
I'm currently working on a small Spring-based web application with an AngularJS frontend. One can upload and download files to it, which are mirrored to some other storage. If I download a file, the application checks if all replicas are there and valid, and if not it re-uploads the file to the storage with a corrupted copy of the file.
The thing I want to achive is, that when I download the file, I want to be able to additionally transfer data about the replicas. More specifically, I want to state to the user, if the file was coruppted somewhere and had to be uploaded. And if so, where this happened.
The code I'm currently using is (I know that it's not very efficient to download from all providers everytime):
public ResponseEntity downloadFile(#RequestParam("fileName") String filename) {
*) Download the file from each storage
*) Check if all replicas are ok
*) If not find the corrupted ones and reupload the file
*) Get one of the OK copies and store it in byte array named "file"
*) Create some headers and store them in a variable named "headers"
return new ResponseEntity<>(file, headers, HttpStatus.OK);
}
What I want to know is:
Is it possible to return something that holds some additional information about the corrupted replicas and which is still handled by the browser like a normal file? So instead of returning a byte array, I would return some other magical object that holds the content of the file, with some additional data?