Most practical way to read an Azure Blob (PDF) in the Cloud?

Most practical way to read an Azure Blob (PDF) in the Cloud? - java

I'm somewhat of a beginner and have never dealt with cloud-based solutions yet before.
My program uses the PDFBox library to extract data from PDFs and rename the file based on the data. It's all local currently, but eventually will need to be deployed as an Azure Function. The PDFs will be stored in an Azure Blob Container - the Azure Blob Storage trigger for Azure Functions is an important reason for this choice.
Of course I can download the blob locally and read it, but the program should run solely in the Cloud. I've tried reading the blobs directly using Java, but this resulted in gibberish data and wasn't compatible with PDFbox. My plan for now is to temp store the files elsewhere in the Cloud (e.g. OneDrive, Azure File Storage) and try opening them from there. However, this seems like it can quickly turn into an overly messy solution. My questions:
(1) Is there any way a blob can be opened as a File, rather than a CloudBlockBlob so this additional step isn't needed?
(2) If no, what would be a recommended temporary storage be in this case?
(3) Are there any alternative ways to approach this issue?

Since you are planning Azure function, you can use blob trigger/binding to get the bytes directly. Then you can use PDFBox PdfDocument load method to directly build the object PDDocument.load(content). You won't need any temporary storage to store the file to load that.
#FunctionName("blobprocessor")
public void run(
#BlobTrigger(name = "file",
dataType = "binary",
path = "myblob/{name}",
connection = "MyStorageAccountAppSetting") byte[] content,
#BindingName("name") String filename,
final ExecutionContext context
) {
context.getLogger().info("Name: " + filename + " Size: " + content.length + " bytes");
PDDocument doc = PDDocument.load(content);
// do your stuffs
}

Related

How to grab all files from Azure Blob and Zip them?

I have an Azure Blob with many containers. Each container has multiple folders - and each folder has a bunch of files in it. I want to be able to grab all of the files and return them zipped. I'm currently only able to get one file at a time...
public void downloadAllFromBlob(String containerName){
CloudBlobClient blobClient = this.storageAccount.createCloudBlobClient();
try{
CloudBlobContainer container = blobClient.getContainerReference(containerName);
if(container.exists()){
// I want to grab all the files in the container and zip them
for(ListBlobItem blobItem: container.listBlobs()){
// i'm only able to list/VIEW the blobs, and not go into one and get all the contents
}
}
}catch(){
}
}

Unfortunately there's no batch retrieve capability available in Azure Blob Storage. You need need to download each blob individually as you showed above. You can try to retrieve blobs in parallel to speed things up.

Disable caching of files on google cloud storage (flexible app engine java)

In Java on FLEXIBLE google app engine, how do you disable caching of files? I don't care if it's disabled on the entire bucket with gsutil, or individual files when I save them, or when they're read. (I just don't want anything cached, as files are frequently replaced and use the same filename).
My code to store files:
private static Storage storageService;
public static void uploadStream(
String name, InputStream stream, String bucketName)
throws IOException, GeneralSecurityException {
Storage storage = StorageOptions.getDefaultInstance().getService();
Blob blob = storage.create(BlobInfo.newBuilder(bucketName, name).build(),stream);}
This code works flawlessly for uploading and replacing pdf files as intended. When the user views the pdf on the web page, if it was recently replaced, they see a cached copy. It takes an hour before the new version can be viewed on the web site.
I'm not sure if this is something where I need to edit the bucket, set no caching when saving the file in java, or set no caching when reading the file. My code for reading the file is:
public ByteArrayOutputStream downloadStream (String bucketName, String filePath)
throws Exception {
Storage storage = getService();
byte [] bytes = storage.readAllBytes(bucketName,filePath);
ByteArrayOutputStream baos = new ByteArrayOutputStream(bytes.length);
baos.write(bytes, 0, bytes.length);
return baos;
}
This is then returned via a web servlet.
Standard google app engine was tagged as well, as this is really a google cloud storage issue, and I'm not sure if the solution lies with gsutil or the cloud console, but note that the java code to access google cloud storage will differ between flexible and standard.

Objects are cacheable if they are publicly readable and the Cache-Control header allows caching. Thus, you can disable caching by changing either/both of these things. See this gsutil documentation about setting the Cache-Control header, for example:
https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata#cache-control

How to zip files in Amazon s3 Bucket and get its URL

I have a bunch of files inside Amazon s3 bucket, I want to zip those file and download get the contents via S3 URL using Java Spring.

S3 is not a file server, nor does it offer operating system file services, such as data manipulation.
If there is many "HUGE" files, your best bet is
start a simple EC2 instance
Download all those files to EC2 instance, compress them, reupload it back to S3 bucket with a new object name
Yes, you can use AWS lambda to do the same thing, but lambda is bounds to 900 seconds (15 mins) execution timeout (Thus it is recommended to allocate more RAM to boost lambda execution performance)
Traffics from S3 to local region EC2 instance and etc services is FREE.
If your main purpose is just to read those file within same AWS region using EC2/etc services, then you don't need this extra step. Just access the file directly.
(Update) :
As mentioned by #Robert Reiz, now you can also use AWS Fargate to do the job.
Note :
It is recommended to access and share file using AWS API. If you intend to share the file publicly, you must look into security issue seriously and impose download restriction. AWS traffics out to internet is never cheap.

Zip them in your end instead of doing it in AWS, ideally in frontend, directly on user browser. You can stream the download of several files in javascript, use that stream to create a zip and save this zip on user disk.
The advantages of moving the zipping to the frontend:
You can use it with S3 URLs, a bunch of presigned links or even mixing content from different sources, some from S3, some of whatever other place.
You don't waste lambda memory, nor have to up an EC2 fargate instance, that saves money. Let the user computer do it for you.
Improves user experience - no needs to wait the zip is created to start downloading it, just start downloading meanwhile the zip is being created.
StreamSaver is useful for this purpose, but in their zipping examples (Saving multiple files as a zip) is limited by less than 4GB files as it doesn't implement zip64. You can combine StreamSaver with client-zip, that support zip64, with something like this (I haven't test this):
import { downloadZip } from 'client-zip';
import streamSaver from 'streamsaver';
const files = [
{
'name': 'file1.txt',
'input': await fetch('test.com/file1')
},
{
'name': 'file2.txt',
'input': await fetch('test.com/file2')
},
]
downloadZip(files).body.pipeTo(streamSaver.createWriteStream('final_name.zip'));
In case you choose this option, keep in mind that if you have CORS enabled in your bucket you will need to add the frontend url where the zipping is done, right in the AllowedOrigins field from your CORS configuration of your bucket.
About performance:
As #aviv-day complains in a comment this could not be suitable for all scenarios. Client-zip library has a benchmark that can give you an idea if this fit or not with your scenario. Generally, if you have a big set of small files (I don't have a number about what is big here, but I'll say something between 100 and 1000) it will take a lot of time just zipping it, and it will drain the final user CPU. Also, if you are offering the same set of files zipped for all the users, it's better zip it one and present it already zipped. Using this method of zipping in frontend works well with a limited small group of files that can dynamically change depending on user preferences about what to download. I've no really test this and I really think the bottle neck would be the network speed more than the zip process, as it happens on the fly, I don't really think that scenario with a big set of files would actually be a problem. If anyone have benchmarks about this would be nice to share with us!

Hi I recently have to do that for my application -- serve a bundle of files in zip format through a url link that the users can download.
In a nutshell, first create an object using BytesIO method, then use the ZipFile method to write into this object by iterating all the s3 objects, then use put method on this zip object and create a presiged url for it.
The code I used looks like this:
First, call this function to get the zip object, ObjectKeys are the s3 objects that you need to put into the zip file.
def zipResults(bucketName, ObjectKeys):
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip_file:
for ObjectKey in ObjectKeys:
objectContent = S3Helper().readFromS3(bucketName, ObjectKey)
fileName = os.path.basename(ObjectKey)
zip_file.writestr(fileName, objectContent)
buffer.seek(0)
return buffer
Then call this function, key is the key you give to your zip object:
def uploadObject(bucketName, body, key):
s3client = AwsHelper().getClient("s3")
try:
response = s3client.put_object(
Bucket=bucketName,
Body=body,
Key=key
)
except ClientError as e:
logging.error(e)
return None
return response
Of course, you would need io, zipfile and boto3 modules.

If you need individual files (objects) in S3 compressed, then it is possible to do so in a round-about way. You can define a CloudFront endpoint pointing to the S3 bucket, then let CloudFront compress the content on the way out: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html

Save image file to HDFS using Spark

I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.

Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!

I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Most practical way to read an Azure Blob (PDF) in the Cloud? - java

Related

How to grab all files from Azure Blob and Zip them?

Disable caching of files on google cloud storage (flexible app engine java)

How to zip files in Amazon s3 Bucket and get its URL

Save image file to HDFS using Spark

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

Categories

Resources