Apache beam Dataflow : File Transfer from Azure to GCS

Apache beam Dataflow : File Transfer from Azure to GCS - java

I have tried to transfer a file from Azure container to GCS bucket, but end up with below issues
Order of the records in source file is different from the Destination file's records order as pipeline will do parallel processing
Have to write lot of custom code to provide the custom name for the GCS destination file as pipeline give default name for it.
Is there anyway, Apache pipeline can transfer the file itself without dealing with the content of the file (so that, above mentioned issues won't happen)? As I need to transfer multiple files from Azure container to GCS bucket
below code I am using to transfer the files at the moment
String format = LocalDateTime.now().format(DateTimeFormatter.ofPattern("YYYY_MM_DD_HH_MM_SS3")).toString();
String connectionString = "<<AZURE_STORAGE_CONNECTION_STRING>>";
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BlobstoreOptions.class).setAzureConnectionString(connectionString);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("azfs://storageaccountname/containername/CSVSample.csv"))
.apply("",FileIO.<String>write().to("azfs://storageaccountname/containername/"+format+"/").withNumShards(1).withSuffix(".csv")
.via(TextIO.sink()));
p.run().waitUntilFinish();

You should be able to use FileIO transforms for this purpose.
For example (untested pseudocode),
FileIO.match().filepattern("azfs://storageaccountname/containername/CSVSample.csv")
.apply(FileIO.readMatches())
.apply(ParDo.of(new MyWriteDoFn()));
Above MyWriteDoFn() would be a DoFn that reads bytes from a single file (using AzureBlobStoreFileSystem) and writes to GCS (using GCSFileSystem). You can use the static methods in FileSystems class with the correct prefix instead of directly invoking methods of the underlying FileSystem implementations.

Related

Most practical way to read an Azure Blob (PDF) in the Cloud?

I'm somewhat of a beginner and have never dealt with cloud-based solutions yet before.
My program uses the PDFBox library to extract data from PDFs and rename the file based on the data. It's all local currently, but eventually will need to be deployed as an Azure Function. The PDFs will be stored in an Azure Blob Container - the Azure Blob Storage trigger for Azure Functions is an important reason for this choice.
Of course I can download the blob locally and read it, but the program should run solely in the Cloud. I've tried reading the blobs directly using Java, but this resulted in gibberish data and wasn't compatible with PDFbox. My plan for now is to temp store the files elsewhere in the Cloud (e.g. OneDrive, Azure File Storage) and try opening them from there. However, this seems like it can quickly turn into an overly messy solution. My questions:
(1) Is there any way a blob can be opened as a File, rather than a CloudBlockBlob so this additional step isn't needed?
(2) If no, what would be a recommended temporary storage be in this case?
(3) Are there any alternative ways to approach this issue?

Since you are planning Azure function, you can use blob trigger/binding to get the bytes directly. Then you can use PDFBox PdfDocument load method to directly build the object PDDocument.load(content). You won't need any temporary storage to store the file to load that.
#FunctionName("blobprocessor")
public void run(
#BlobTrigger(name = "file",
dataType = "binary",
path = "myblob/{name}",
connection = "MyStorageAccountAppSetting") byte[] content,
#BindingName("name") String filename,
final ExecutionContext context
) {
context.getLogger().info("Name: " + filename + " Size: " + content.length + " bytes");
PDDocument doc = PDDocument.load(content);
// do your stuffs
}

How to zip files in Amazon s3 Bucket and get its URL

I have a bunch of files inside Amazon s3 bucket, I want to zip those file and download get the contents via S3 URL using Java Spring.

S3 is not a file server, nor does it offer operating system file services, such as data manipulation.
If there is many "HUGE" files, your best bet is
start a simple EC2 instance
Download all those files to EC2 instance, compress them, reupload it back to S3 bucket with a new object name
Yes, you can use AWS lambda to do the same thing, but lambda is bounds to 900 seconds (15 mins) execution timeout (Thus it is recommended to allocate more RAM to boost lambda execution performance)
Traffics from S3 to local region EC2 instance and etc services is FREE.
If your main purpose is just to read those file within same AWS region using EC2/etc services, then you don't need this extra step. Just access the file directly.
(Update) :
As mentioned by #Robert Reiz, now you can also use AWS Fargate to do the job.
Note :
It is recommended to access and share file using AWS API. If you intend to share the file publicly, you must look into security issue seriously and impose download restriction. AWS traffics out to internet is never cheap.

Zip them in your end instead of doing it in AWS, ideally in frontend, directly on user browser. You can stream the download of several files in javascript, use that stream to create a zip and save this zip on user disk.
The advantages of moving the zipping to the frontend:
You can use it with S3 URLs, a bunch of presigned links or even mixing content from different sources, some from S3, some of whatever other place.
You don't waste lambda memory, nor have to up an EC2 fargate instance, that saves money. Let the user computer do it for you.
Improves user experience - no needs to wait the zip is created to start downloading it, just start downloading meanwhile the zip is being created.
StreamSaver is useful for this purpose, but in their zipping examples (Saving multiple files as a zip) is limited by less than 4GB files as it doesn't implement zip64. You can combine StreamSaver with client-zip, that support zip64, with something like this (I haven't test this):
import { downloadZip } from 'client-zip';
import streamSaver from 'streamsaver';
const files = [
{
'name': 'file1.txt',
'input': await fetch('test.com/file1')
},
{
'name': 'file2.txt',
'input': await fetch('test.com/file2')
},
]
downloadZip(files).body.pipeTo(streamSaver.createWriteStream('final_name.zip'));
In case you choose this option, keep in mind that if you have CORS enabled in your bucket you will need to add the frontend url where the zipping is done, right in the AllowedOrigins field from your CORS configuration of your bucket.
About performance:
As #aviv-day complains in a comment this could not be suitable for all scenarios. Client-zip library has a benchmark that can give you an idea if this fit or not with your scenario. Generally, if you have a big set of small files (I don't have a number about what is big here, but I'll say something between 100 and 1000) it will take a lot of time just zipping it, and it will drain the final user CPU. Also, if you are offering the same set of files zipped for all the users, it's better zip it one and present it already zipped. Using this method of zipping in frontend works well with a limited small group of files that can dynamically change depending on user preferences about what to download. I've no really test this and I really think the bottle neck would be the network speed more than the zip process, as it happens on the fly, I don't really think that scenario with a big set of files would actually be a problem. If anyone have benchmarks about this would be nice to share with us!

Hi I recently have to do that for my application -- serve a bundle of files in zip format through a url link that the users can download.
In a nutshell, first create an object using BytesIO method, then use the ZipFile method to write into this object by iterating all the s3 objects, then use put method on this zip object and create a presiged url for it.
The code I used looks like this:
First, call this function to get the zip object, ObjectKeys are the s3 objects that you need to put into the zip file.
def zipResults(bucketName, ObjectKeys):
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip_file:
for ObjectKey in ObjectKeys:
objectContent = S3Helper().readFromS3(bucketName, ObjectKey)
fileName = os.path.basename(ObjectKey)
zip_file.writestr(fileName, objectContent)
buffer.seek(0)
return buffer
Then call this function, key is the key you give to your zip object:
def uploadObject(bucketName, body, key):
s3client = AwsHelper().getClient("s3")
try:
response = s3client.put_object(
Bucket=bucketName,
Body=body,
Key=key
)
except ClientError as e:
logging.error(e)
return None
return response
Of course, you would need io, zipfile and boto3 modules.

If you need individual files (objects) in S3 compressed, then it is possible to do so in a round-about way. You can define a CloudFront endpoint pointing to the S3 bucket, then let CloudFront compress the content on the way out: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html

Save image file to HDFS using Spark

I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.

Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}

Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0

textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details

This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.

Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

Random-access Zip file without writing it to disk

I have a 1-2GB zip file with 500-1000k entries. I need to get files by name in fraction of second, without full unpacking. If file is stored on HDD, this works fine:
public class ZipMapper {
private HashMap<String,ZipEntry> map;
private ZipFile zf;
public ZipMapper(File file) throws IOException {
map = new HashMap<>();
zf = new ZipFile(file);
Enumeration<? extends ZipEntry> en = zf.entries();
while(en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
map.put(ze.getName(), ze);
}
}
public Node getNode(String key) throws IOException {
return Node.loadFromStream(zf.getInputStream(map.get(key)));
}
}
But what can I do if program downloaded the zip file from Amazon S3 and has its InputStream (or byte array)? While downloading 1GB takes ~1 second, writing it to HDD may take some time, and it is slightly harder to handle multiple files since we don't have HDD garbage collector.
ZipInputStream does not allow to random access to entries.
It would be nice to create a virtual File in memory by byte array, but I couldn't find a way to.

You could mark the file to be deleted on exit.
If you want to go for an in-memory approach: Have a look at the new NIO.2 File API. Oracle provides a filesystem provider for zip/ jar and AFAIK ShrinkWrap provides an in-memory filesystem. You could try a combination of the two.
I've written some utility methods to copy directories and files to/from a Zip file using the NIO.2 File API (the library is Open Source):
Maven:
<dependency>
<groupId>org.softsmithy.lib</groupId>
<artifactId>softsmithy-lib-core</artifactId>
<version>0.3</version>
</dependency>
Tutorial:
http://softsmithy.sourceforge.net/lib/current/docs/tutorial/nio-file/index.html
API: CopyFileVisitor.copy
Especially PathUtils.resolve helps with resolving paths across filesystems.

You can use SecureBlackbox library, it allows ZIP operations on any seekable streams.

I think you should consider using your OS in order to create "in memory" file system (i.e - RAM drive).
In addition, take a look at the FileSystems API.

A completely different approach: If the server has the file on disk (and possibly cached in RAM already): make it give you the file(s) directly. In other words, submit which files you need and then take care to extract and deliver these on the server.

Blackbox library only has Extract(String name, String outputPath) method. Seems that it can randomly access any file in seekable zip-stream indeed, but it can't write result to byte array or return stream.
I couldn't find and documentation for ShrinkWrap. I couldn't find any suitable implementations of FileSystem/FileSystemProvider etc.
However, it turned out that Amazon EC2 instance I'm running (Large) somehow writes 1gb file to disk in ~1 second. So I just write file to the disk and use ZipFile.
If HDD would be slow, I think RAM disk would be the easiest solution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache beam Dataflow : File Transfer from Azure to GCS - java

Related

Most practical way to read an Azure Blob (PDF) in the Cloud?

How to zip files in Amazon s3 Bucket and get its URL

Save image file to HDFS using Spark

Print the content of streams (Spark streaming) in Windows system

Random-access Zip file without writing it to disk

Categories

Resources