Creating an Avro file in Amazon S3 bucket - java

How to create an Avro file in s3 bucket and then appending avro records to it.
I have all the avro records in the form of Byte array and were successfully transferred in an avro file. But his file is (what i know) not a complete avro file. Since a complete avro file is schema + data.
Following is the code to transfer the byte records in a file in S3.
Any one knows how to create a avro schema based file and then transfer these bytes to that same file.
public void sendByteData(byte [] b, Schema schema){
try{
AWSCredentials credentials = new BasicAWSCredentials("XXXXX", "XXXXXX");
AmazonS3 s3Client = new AmazonS3Client(credentials);
//createFolder("encounterdatasample", "avrofiles", s3Client);
ObjectMetadata meta = new ObjectMetadata();
meta.setContentLength(b.length);
InputStream stream = new ByteArrayInputStream(b);
/* File file = new File("/home/abhishek/sample.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
s3Client.putObject("encounterdatasample", dataFileWriter.create(schema, file), stream, meta);
*/
s3Client.putObject("encounterdatasample", "sample.avro", stream,meta);
System.out.println("Done writing the data");
}catch(Exception e){
e.printStackTrace();
}
}
The code in comments doesn't work. Was just trying to play around with it.
Any help on this.
Thanks.

I believe your assertion is correct, you can't encode both the data and the schema in the byte array. You need to use some container, typically a file, to encode both.
With a few fixes, the code you have commented out should work. I just did something similar from within a Lambda written in Java. I wrote the file out to local disk (/tmp) using DataFileWriter, then put that file to S3 using your syntax without issue.
Two suggestions:
call dataFileWriter.close() once you're finished writing to file.
use the file object directly in the s3Client.putObject call, e.g. s3Client.putObject(bucket,key,file)

Related

Can we write csv file to S3 without creating a file on local in spring boot?

I've some set of objects and want to store these data as CSV file on AWS S3 bucket without creating any local file. Can anyone please suggest how could it be smoothly done without impacting performance?
For Eg.: Set<Address> addresses; //Address class contains some fields as city, state, zipcode and country.
It should create a csv file with following headers and data:
City, State, ZipCode, Country
-----------------------------
Mumbai, Maharashtra, 4200091, India
One thing I know is we can write data as InputStream and then pass it to -PutObjectRequest. But InputStream also takes filepath I don't want to waste time in creating temp files and I've multiple operations to do.
PutObjectRequest putObj = new PutObjectRequest(bucketName, KeyName, inputStream, metadata);
s3client.putObject(putObj);
Thanks in advance for your help and time.
You could do something like this:
Create csv output stream:(Refer:)
Dependency to be added (Refer to above link for the example used):
<dependency>
<groupId>net.sourceforge.javacsv</groupId>
<artifactId>javacsv</artifactId>
<version>2.0</version> </dependency>
Code :
public void createCsv() throws Exception {
ByteArrayOutputStream stream = new ByteArrayOutputStream();
CsvWriter writer = new CsvWriter(stream, ',', Charset
.forName("ISO-8859-1"));
writer.setRecordDelimiter(';');
// WRITE Your CSV Here
//writer.write("a;b");
writer.endRecord();
writer.close();
stream.close(); }
Convert output stream to input stream via byte array:
InputStream inputStream = new ByteArrayInputStream(stream.toByteArray());
Pass this stream to S3 PutObjectRequest
InputStream is an abstract class (perhaps should have been an interface). You can use anything that subclasses InputStream or that can produce an InputStream, for example StringBufferInputStream which will stream a StringBuffer.

Java - Why does BufferedReader(Writer) create a corrupted excel(.xls), but BufferedInput(Output)Stream creates a good one

At the company I work, we have a job that retrieves emails, gets their attachments and saves them. Until now it only had to work with .xml and .txt files and it worked well.
We use the JavaMail 1.4.4 package. Existing code(modified to be more simpler. Don't mind the type checks):
Message message = ...;
MultiPart mp = (MultiPart)message.getContent();
File file = new File(newFileName);
Part part = mp.getBodyPart(indexWhereIsAttachement);
InputStream inputStream = part.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
BufferedWriter writer = new BufferedWriter(new FileWriter(file));
//method that read all from reader and writes to writer
When I use a .xls file, it doesn't work. This creates a corrupted .xls file. I can't open it with LibreOffice, neither can I open it as a Apache WorkBook in code. But it works for .xml and .txt.
But if I do this:
...
File file = new File(newFileName);
Part part = mp.getBodyPart(indexWhereIsAttachement);
((MimeBodyPart)part).saveFile(file);
It works fine. Looking at the "saveFile()" method, it uses a BufferedInput(Output)Stream. So while reading the file, it doesn't convert the data to characters. Is this what's causing the issues? What exactly happens, that breaks everything?

Download file to stream instead of File

I'm implementing an helper class to handle transfers from and to an AWS S3 storage from my web application.
In a first version of my class I was using directly a AmazonS3Client to handle upload and download, but now I discovered TransferManager and I'd like to refactor my code to use this.
The problem is that in my download method I return the stored file in form of byte[]. TransferManager instead has only methods that use File as download destination (for example download(GetObjectRequest getObjectRequest, File file)).
My previous code was like this:
GetObjectRequest getObjectRequest = new GetObjectRequest(bucket, key);
S3Object s3Object = amazonS3Client.getObject(getObjectRequest);
S3ObjectInputStream objectInputStream = s3Object.getObjectContent();
byte[] bytes = IOUtils.toByteArray(objectInputStream);
Is there a way to use TransferManager the same way or should I simply continue using an AmazonS3Client instance?
The TransferManager uses File objects to support things like file locking when downloading pieces in parallel. It's not possible to use an OutputStream directly. If your requirements are simple, like downloading small files from S3 one at a time, stick with getObject.
Otherwise, you can create a temporary file with File.createTempFile and read the contents into a byte array when the download is done.

Upload to S3 using Gzip in Java

I'm new to Java and I'm trying to upload a large file ( ~10GB ) to Amazon S3. Could anyone please help me with how to use GZip outputsteam for it ?
I've been through some documentations but got confused about Byte Streams, Gzip streams. They must be used together ? Can anyone help me with this piece of code ?
Thanks in advance.
Have a look at this,
Is it possible to gzip and upload this string to Amazon S3 without ever being written to disk?
ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
GZipOuputStream gzipOut = new GZipOutputStream(byteOut);
// write your stuff
byte[] bites = byteOut.toByteArray();
//write the bites to the amazon stream
Since its a large file you might want to have a look at multi part upload
This question could have been more specific and there are several ways to achieve this. One approach might look like the below.
The example depends on the commons-io and commons-compress libraries, and uses classes from the java.nio.file package.
public static void compressAndUpload(AmazonS3 s3, InputStream in)
throws IOException
{
// Create temp file
Path tmpPath = Files.createTempFile("prefix", "suffix");
// Create and write to gzip compressor stream
OutputStream out = Files.newOutputStream(tmpPath);
GzipCompressorOutputStream gzOut = new GzipCompressorOutputStream(out);
IOUtils.copy(in, gzOut);
// Read content from temp file
InputStream fileIn = Files.newInputStream(tmpPath);
long size = Files.size(tmpPath);
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("application/x-gzip");
metadata.setContentLength(size);
// Upload file to S3
s3.putObject(new PutObjectRequest("bucket", "key", fileIn, metadata));
}
Buffering, error handling and closing of streams are omitted for brevity.

Write stream into mongoDB in Java

I have a file to store in mongoDB. What I want is to avoid loading the whole file (which could be several MBs in size) instead I want to open the stream and direct it to mongoDB to keep the write operation performant. I dont mind storing the content in base64 encoded byte[].
Afterwards I want to do the same at the time of reading the file i.e. not to load the whole file in memory, instead read it in a stream.
I am currently using hibernate-ogm with Vertx server but I am open to switch to a different api if it servers the cause efficiently.
I want to actually store a document with several fields and several attachments.
You can use GridFS. Especially when you need to store larger files (>16MB) this is the recommended method:
File f = new File("sample.zip");
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(f);
gfsFile.setFilename(f.getName());
gfsFile.setId(id);
gfsFile.save();
Or in case you have an InputStream in:
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(in);
gfsFile.setFilename("sample.zip");
gfsFile.setId(id);
gfsFile.save();
You can load a file using one of the GridFS.find methods:
GridFSDBFile gfsFile = gfs.findOne(id);
InputStream in = gfsFile.getInputStream();

Categories