Write stream into mongoDB in Java - java

I have a file to store in mongoDB. What I want is to avoid loading the whole file (which could be several MBs in size) instead I want to open the stream and direct it to mongoDB to keep the write operation performant. I dont mind storing the content in base64 encoded byte[].
Afterwards I want to do the same at the time of reading the file i.e. not to load the whole file in memory, instead read it in a stream.
I am currently using hibernate-ogm with Vertx server but I am open to switch to a different api if it servers the cause efficiently.
I want to actually store a document with several fields and several attachments.

You can use GridFS. Especially when you need to store larger files (>16MB) this is the recommended method:
File f = new File("sample.zip");
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(f);
gfsFile.setFilename(f.getName());
gfsFile.setId(id);
gfsFile.save();
Or in case you have an InputStream in:
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(in);
gfsFile.setFilename("sample.zip");
gfsFile.setId(id);
gfsFile.save();
You can load a file using one of the GridFS.find methods:
GridFSDBFile gfsFile = gfs.findOne(id);
InputStream in = gfsFile.getInputStream();

Related

Java multipart upload to s3

My method receives a buffered reader and transforms each line in my file. However I need to upload the output of this transformation to an s3 bucket. The files are quite large so I would like to be able to stream my upload into an s3 object.
To do so, I think I need to use a multipart upload however I'm not sure I'm using it correctly as nothing seems to get uploaded.
Here is my method:
public void transform(BufferedReader reader)
{
Scanner scanner = new Scanner(reader);
String row;
List<PartETag> partETags = new ArrayList<>();
InitiateMultipartUploadRequest request = new InitiateMultipartUploadRequest("output-bucket", "test.log");
InitiateMultipartUploadResult result = amazonS3.initiateMultipartUpload(request);
while (scanner.hasNext()) {
row = scanner.nextLine();
InputStream inputStream = new ByteArrayInputStream(row.getBytes(Charset.forName("UTF-8")));
log.info(result.getUploadId());
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName("output-bucket")
.withKey("test.log")
.withUploadId(result.getUploadId())
.withInputStream(inputStream)
.withPartNumber(1)
.withPartSize(5 * 1024 * 1024);
partETags.add(amazonS3.uploadPart(uploadRequest).getPartETag());
}
log.info(result.getUploadId());
CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(
"output-bucket",
"test.log",
result.getUploadId(),
partETags);
amazonS3.completeMultipartUpload(compRequest);
}
Oh, I see. The InitiateMultipartUploadRequest needs to read from an input stream. This is a valid constraint, since you can only write to output streams in general.
You probably heard that you can copy data from InputStream to ByteArrayOutputStream. Then take the resulting byte-array and create an ByteArrayInputStream. You could feed this to your request object. BUT: All data will in one byte array at a certain time. Since your use case is about large files, this cannot be o.k.
What you need is to create a custom input stream class which transforms the original input stream into another input stream. It requires you to work on a byte level abstraction. It would however offer the best performance. I suggest to ask a new question if you like to know more about that.
Your transformation code is already finished and you don't want to touch it again? There is another approach. You could also just "connect" an output stream to an input stream by using pipes: https://howtodoinjava.com/java/io/convert-outputstream-to-inputstream-example/. The catch: you are dealing with multi-threading here.

Output Streaming multiple files into a zip file

Hie , I m generating report in csv format using solr , angularjs , Jax-rs and java. The input stream contain a csv response already because we have specified wt=csv while querying solr. Size of csv created from every input Stream might be 300mb .At java layer code is some thing like :
enter code here
InputStream is1;
InputStream is2;
// for is1 let file is csv1
// for is2 let file is csv2
// csvzip is the csv created form both two files
// now csvzip need to be downloaded through a popup
Creating big size file and zipfile in memory will not be a good approach surely.
Is there is any way to handle this?

Download file to stream instead of File

I'm implementing an helper class to handle transfers from and to an AWS S3 storage from my web application.
In a first version of my class I was using directly a AmazonS3Client to handle upload and download, but now I discovered TransferManager and I'd like to refactor my code to use this.
The problem is that in my download method I return the stored file in form of byte[]. TransferManager instead has only methods that use File as download destination (for example download(GetObjectRequest getObjectRequest, File file)).
My previous code was like this:
GetObjectRequest getObjectRequest = new GetObjectRequest(bucket, key);
S3Object s3Object = amazonS3Client.getObject(getObjectRequest);
S3ObjectInputStream objectInputStream = s3Object.getObjectContent();
byte[] bytes = IOUtils.toByteArray(objectInputStream);
Is there a way to use TransferManager the same way or should I simply continue using an AmazonS3Client instance?
The TransferManager uses File objects to support things like file locking when downloading pieces in parallel. It's not possible to use an OutputStream directly. If your requirements are simple, like downloading small files from S3 one at a time, stick with getObject.
Otherwise, you can create a temporary file with File.createTempFile and read the contents into a byte array when the download is done.

How to perform Update operations in GridFS (using Java)?

I am using Mongo-Java-Driver 2.13 I stored a PDF file (size 30mb) in GridFS. I am able to perform insertion, deletion and find operation easily.
MongoClient mongo = new MongoClient("localhost", 27017);
DB db = mongo.getDB("testDB");
File pdfFile = new File("/home/dev/abc.pdf");
GridFS gfs = new GridFS(db,"books");
GridFSInputFile inputFile = gfs.createFile(pdfFile);
inputFile.setId("101");
inputFile.put("title", "abc");
inputFile.put("author", "xyz");
inputFile.save();
data is persisted in books.files and books.chunks collections. Now I want to update :
case 1: pdf file
case 2: title or author
How to perform these Update operations for Case 1 in GridFS ?
I came to know that I need to maintain multiple versions of my files and pick up the right version. Can anybody put some clarity on it?
Edit:
I can update metadata(title, author) easily.
GridFSDBFile outputFile = gfs.findOne(new BasicDBObject("_id", 101));
BasicDBObject updatedMetadata = new BasicDBObject();
updatedMetadata.put("name", "PG");
updatedMetadata.put("age", 22);
outputFile.setMetaData(newMetadata);
outputFile.save();
In GridFS you are not removing/deleting a single document but actually a bunch of documents (files are split into chunks and each chunk is a separate document). That means replacing a file is simply not possible in an atomic manner.
What you can do instead is:
insert a new file with a new name
after this happened (use the replica acknowledged write-concern), update all references to the old file to point to the new one
after you got a confirmation for this, you can delete the old file
GridFS is kind of a hackish feature. It is often better to just use a separate fileserver with a real filesystem to store the file content and only store the metadata in MongoDB.

Java- using an InputStream as a File

I'm trying to generate a PDF document from an uploaded ".docx" file using JODConverter.
The call to the method that generates the PDF is something like this :
File inputFile = new File("document.doc");
File outputFile = new File("document.pdf");
// connect to an OpenOffice.org instance running on port 8100
OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
connection.connect();
// convert
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
// close the connection
connection.disconnect();
I'm using apache commons FileUpload to handle uploading the docx file, from which I can get an InputStream object. I'm aware that Java.io.File is just an abstract reference to a file in the system.
I want to avoid the disk write (saving the InputStream to disk) and the disk read (reading the saved file in JODConverter).
Is there any way I can get a File object refering to an input stream? just any other way to avoid disk IO will also do!
EDIT: I don't care if this will end up using a lot of system memory. The application is going to be hosted on a LAN with very little to zero number of parallel users.
File-based conversions are faster than stream-based ones (provided by StreamOpenOfficeDocumentConverter) but they require the OpenOffice.org service to be running locally and have the correct permissions to the files.
Try the doc to avoid disk writting:
convert(java.io.InputStream inputStream, DocumentFormat inputFormat, java.io.OutputStream outputStream, DocumentFormat outputFormat)
There is no way to do it and make the code solid. For one, the .convert() method only takes two Files as arguments.
So, this would mean you'd have to extend File, which is possible in theory, but very fragile, as you are required to delve into the library code, which can change at any time and make your extended class non functional.
(well, there is a way to avoid disk writes if you use a RAM-backed filesystem and read/write from that filesystem, of course)
Chances are that commons fileupload has written the upload to the filesystem anyhow.
Check if your FileItem is an instance of DiskFileItem. If this is the case the write implementation of DiskFileItem willl try to move the file to the file object you pass. You are not causing any extra disk io then since the write already happened.

Categories