Reading from S3 location with dynamic addition of files through java client - java

I have an S3 bucket and files are being to the bucket by an ever running process. Now I want to write a consumer of these using java aws-s3-sdk. How to make sure, every file is read exactly once?

Related

Play Framework file upload to in memory to S3

I am writing a web server with Play framework 2.6 in Java. I want to upload a file to WebServer through a multipart form and do some validations, then upload the file s3. The default implementation in play saves the file to a temporary file in the file system but I do no want to do that, I want to upload the file straight to AWS S3.
I looked into this tutorial, which explains how to save file the permanently in file system instead of using temporary file. To my knowledge I have to make a custom Accumulator or a Sink that saves the incoming ByteString(s) to a byte array? but I cannot find how to do so, can someone point me in the correct direction?
thanks

Is it safe to store .ppk files securely in S3 - use of S3, Java and ssh

Part of the application I'm working connects to an instance using ssh. It requires a .ppk file which I've currently got stored in S3.
My concern is that it's not secure enough and I'm looking for a method in which to make it so.
I've considered encrypting the S3 bucket and allowing programmatic access only, the bucket and file location can be fed to app via env variables.
I really don't want to keep the file in the resources as anyone getting the jar cam unzip and obtain, same with hardcoded values in the codebase. Is this a safe way of storing this file? Would encrypting it be worth the additional steps?

Can I use one S3 bucket for upload different java lambda function?

Currently I am using different S3 bucket for every function.
Ex. I have 3 Java Lambda Function created on Eclipse IDE.
RegisterUser
LoginUser
ResetPassword
I am uploading lambda function through Eclipse IDE,
I have to upload function through Amazon S3 Bucket.
I create 3 Amazon S3 Bucket for upload all 3 function.
My Question is : Can I upload all 3 Lambda Function using one Amazon S3 Bucket?
or
I have to create separate Amazon S3 Bucketfor all function.?
You don't need to upload to a bucket. You can upload the function code via the command line as well. They only recommend not using the web interface for large Lambda functions, all other methods are ok, and command line is a very good option.
However, if you really want to upload to a bucket first, just give each zip file that contains the function code a different filename and you're good.

How can I access a file's content from mappers in Amazon elastic map reduce?

If I am running an EMR job (in Java) on Amazon Web Services to process large amounts of data, is it possible to have every single mapper access a small file stored on S3? Note that the small file I am talking about is NOT the input to the mappers. Rather, the mappers need to process the input according to some rules in the small file. Maybe the large input file is a billion lines of text, for example, and I want to filter out words that are in a blacklist or something by reading a small file of blacklisted words stored in an S3 bucket. In this case, each mapper would process different parts of the input data, but they would all need to access the restricted words file on S3. How can I make the mappers do this in Java?
EDIT: I am not using the Hadoop framework, so there is no setup() or map() method calls. I am simply using the streaming EMR service and reading stdin line by line from input file.
You can access any S3 object within an mapper using S3 protocol directly. Eg. s3://mybucket/pat/to/file.txt
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html .
You can actually use S3 to access your mapper's input files as well as any ad hoc lookup file as you are thinking to use. Previously these were differentiated by the use of s3n:// protocol for s3 object use and s3bfs:// for block storage. Now you dont have to differentiate and just use s3://
Alternatively, you can have an s3distcp step in the EMR cluster to copy the file - and make it available in hdfs. (this is not what you asked about but.. ) http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

How do I know that Apache Camel route has no more files to copy

I am writing simple command line application, which copies files from ftp server to local drive. Lets assume that I am using the following route definition:
File tmpFile = File.createTempFile("repo", "dat");
IdempotentRepository<String> repository = FileIdempotentRepository.fileIdempotentRepository(tmpFile);
from("{{ftp.server}}")
.idempotentConsumer(header("CamelFileName"), repository)
.to("file:target/download")
.log("Downloaded file ${file:name} complete.");
where ftp.server is something like:
ftp://ftp-server.com:21/mypath?username=foo&password=bar&delay=5
Let's assume that files on the ftp server will not change over time. How do I check, whether the coping has finished or there are still some more file to copy? I need this, because I want to finish my app, once all file are copied.
Read about batch consumer
http://camel.apache.org/batch-consumer.html
The ftp consumer will set some exchange properties with the number of files, and if its the last file etc.
Do you have any control over the end that publishes the FTP files? E.g. is it your server and your client or can you make a request as a customer?
If so, you could ask for a flag file to be added at the end of their batch process. This is a single byte file with an agreed name that you watch for - when that file appears you know the batch is complete.
This is a useful technique if you regularly pull down huge files and they take a long time for a batch process to copy to disk at the server end. E.g. a file is produced by some streaming process.

Categories