Update InputStream before further processing - java

I've got a system where I process files basing on a sample file. So basically I should receive excel file with headers and then rows with information. Now the users send some additional headers and trailers which fails in processing in Apache POI.
I've added additional fields on GUI where user can add how many leading and trailing rows are additionally so I can remove them while parsing excel. So basically I receive a file as an InputStream on spring endpoint then the validation happens and then file is pushed to S3. So I am wondering if there is any chance to update that InputStream and remove that wrong records before S3 upload?
Do I need to save updated file and then read it back to get new InputStream or there is any better way to do that?
public InputStream cleanFileBeforeS3Upload(InputStream inputStream, Definition definition) throws IOException, InvalidFormatException {
var workbook = WorkbookFactory.create(inputStream);
var sheet = workbook.getSheetAt(0);
ExcelUtils.removeLeadingAndTrailingRowsFromExcel(sheet, definition.getTrimLeadingRows(), definition.getTrimTrailingRows());
// ????? How to get new updated inputstream from above workbook
return inputStream;
}
//Line which upload file to S3
var request = new PutObjectRequest(s3Properties.getBucket(), s3ObjectKey, file.getInputStream(), metadata);

I think you can do something like:
...
var out = new ByteArrayOutputStream();
workbook.write(out);
return new ByteArrayInputStream(out.toByteArray());

Related

Creating an Avro file in Amazon S3 bucket

How to create an Avro file in s3 bucket and then appending avro records to it.
I have all the avro records in the form of Byte array and were successfully transferred in an avro file. But his file is (what i know) not a complete avro file. Since a complete avro file is schema + data.
Following is the code to transfer the byte records in a file in S3.
Any one knows how to create a avro schema based file and then transfer these bytes to that same file.
public void sendByteData(byte [] b, Schema schema){
try{
AWSCredentials credentials = new BasicAWSCredentials("XXXXX", "XXXXXX");
AmazonS3 s3Client = new AmazonS3Client(credentials);
//createFolder("encounterdatasample", "avrofiles", s3Client);
ObjectMetadata meta = new ObjectMetadata();
meta.setContentLength(b.length);
InputStream stream = new ByteArrayInputStream(b);
/* File file = new File("/home/abhishek/sample.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
s3Client.putObject("encounterdatasample", dataFileWriter.create(schema, file), stream, meta);
*/
s3Client.putObject("encounterdatasample", "sample.avro", stream,meta);
System.out.println("Done writing the data");
}catch(Exception e){
e.printStackTrace();
}
}
The code in comments doesn't work. Was just trying to play around with it.
Any help on this.
Thanks.
I believe your assertion is correct, you can't encode both the data and the schema in the byte array. You need to use some container, typically a file, to encode both.
With a few fixes, the code you have commented out should work. I just did something similar from within a Lambda written in Java. I wrote the file out to local disk (/tmp) using DataFileWriter, then put that file to S3 using your syntax without issue.
Two suggestions:
call dataFileWriter.close() once you're finished writing to file.
use the file object directly in the s3Client.putObject call, e.g. s3Client.putObject(bucket,key,file)

Writing spreadsheet to a blob column and retrieving it back using POI

We have a requirement to store the uploaded spreadsheet in an Oracle database blob column.
User will upload the spreadsheet using the ADF UI and the same spreadsheet will be persisted to the DB and retrieved later.
We are using POI to process the spreadsheet.
The uploaded file is converted to byte[] and sent to the appserver. Appserver persists the same to the blob column.
But when I am trying to retrieve the same later,I am seeing "Excel found unreadable content in *****.xlsx.Do you want to recover the contents of this workbook?" message.
I could resolve this issue by
Converting the byte[] to XSSFWorkbook and converting the same back to byte[] and persisting it.
But according to my requirement I may get very large spreadsheet and initializing XSSFWorkbook might result into outofmemory issues.
The code to get the byte[] from the uploaded spreadsheet is as below
if (uploadedFile != null) {
InputStream inputStream;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
inputStream = uploadedFile.getInputStream();
int c = 0;
while (c != -1) {
c = inputStream.read();
byteArrayOutputStream.write((char) c);
}
bytes = byteArrayOutputStream.toByteArray();
}
and the same byte[] is being persisted into a blob column as below.
1. Assign this byte[] to the BlobCloumn
2. Update the SQL Update statement with the bolobColumn
3. Execute the Statement.
Once the above step is done, retrieve the spreadsheet as follows.
1. Read the BlobColumn
2. Get the bytes[] from the BlobColumn
3. Set the content-type of the response to support the spreadsheet.
4. Send the byte[].
But when I open the above downloaded spreadsheet I am getting the spreadsheet corrupted error.
If I introduce an additional step as below after receiving the byte[] from the UI, the issue is solved.
InputStream is = new ByteArrayInputStream(uploadedSpreadSheetBytes);
XSSFWorkbook uploadedWorkbook = new XSSFWorkbook(is);
and then, derive the byte[] again from the XSSFWorkbook as below
byteArrayOutputStream = new ByteArrayOutputStream();
workbook.write(byteArrayOutputStream);
byte[] spreadSheetBytes = byteArrayOutputStream.toByteArray();
I feel converting the byte[] to XSSFWorkbook and then converting the XSSFWorkbook back to byte[] is redundant.
Any help would be appreciated.
Thanks.
The memory issues can be avoided by, instead of initializing the entire XSSFWorkbook, using event-based parsing (SAX). This way, you are only processing parts of the file which consumes less memory. See the POI spreadsheet how-to page for more info on SAX parsing.
You could also increase the JVM memory, but that's no guarantee of course.

Upload to S3 using Gzip in Java

I'm new to Java and I'm trying to upload a large file ( ~10GB ) to Amazon S3. Could anyone please help me with how to use GZip outputsteam for it ?
I've been through some documentations but got confused about Byte Streams, Gzip streams. They must be used together ? Can anyone help me with this piece of code ?
Thanks in advance.
Have a look at this,
Is it possible to gzip and upload this string to Amazon S3 without ever being written to disk?
ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
GZipOuputStream gzipOut = new GZipOutputStream(byteOut);
// write your stuff
byte[] bites = byteOut.toByteArray();
//write the bites to the amazon stream
Since its a large file you might want to have a look at multi part upload
This question could have been more specific and there are several ways to achieve this. One approach might look like the below.
The example depends on the commons-io and commons-compress libraries, and uses classes from the java.nio.file package.
public static void compressAndUpload(AmazonS3 s3, InputStream in)
throws IOException
{
// Create temp file
Path tmpPath = Files.createTempFile("prefix", "suffix");
// Create and write to gzip compressor stream
OutputStream out = Files.newOutputStream(tmpPath);
GzipCompressorOutputStream gzOut = new GzipCompressorOutputStream(out);
IOUtils.copy(in, gzOut);
// Read content from temp file
InputStream fileIn = Files.newInputStream(tmpPath);
long size = Files.size(tmpPath);
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("application/x-gzip");
metadata.setContentLength(size);
// Upload file to S3
s3.putObject(new PutObjectRequest("bucket", "key", fileIn, metadata));
}
Buffering, error handling and closing of streams are omitted for brevity.

Converting HTML to proper Excel File

I currently create xls files by outputting an HTML table with the proper header and file extension. Here is the sample code:
String tabledata = "<table>MORE TABLE HTML HERE</table>";
response.setContentType("application/vnd.ms-excel");
response.setHeader("Content-Disposition", "attachment; filename=\"somefile.xls\"");
out.println(tabledata);
This works great except you get an error before opening "The file format and extension of 'somefile.xls' don't match." Once opened, it looks exactly how I want it.
I used Apache POI elsewhere, so figured I could try using this, but I don't want to have to build all elements in some long script looking for rows, columns and different styles. I was thinking that I could convert the HTML to an InputStream and then create the Workbook, but it doesn't like the headers. Here is the code:
InputStream is = new ByteArrayInputStream(tabledata.getBytes("UTF-8"));
HSSFWorkbook wb = new HSSFWorkbook(is);
FileOutputStream fileOut = new FileOutputStream("somefile.xls");
wb.write(fileOut);
fileOut.close();
Is something like this possible if the headers were added/changed? If so, how would I modify this code to work properly?

Append full PDF file to FOP PDF

I have an xml file already being created and rendered as a PDF sent over a servlet:
TraxInputHandler input = new TraxInputHandler(
new File(XML_LOCATION+xmlFile+".xml"),
new File(XSLT_LOCATION)
);
ByteArrayOutputStream out = new ByteArrayOutputStream();
//driver is just `new Driver()`
synchronized (driver) {
driver.reset();
driver.setRenderer(Driver.RENDER_PDF);
driver.setOutputStream(out);
input.run(driver);
}
//response is HttpServletResponse
byte[] content = out.toByteArray();
response.setContentType("application/pdf");
response.setContentLength(content.length);
response.getOutputStream().write(content);
response.getOutputStream().flush();
This is all working perfectly fine.
However, I now have another PDF file that I need to include in the output. This is just a totally separate .pdf file that I was given. Is there any way that I can append this file either to the response, the driver, out, or anything else to include it in the response to the client? Will that even work? Or is there something else I need to do?
We also use FOP to generate some documents, and we accept uploaded documents, all of which we eventually combine into a single PDF.
You can't just send them sequentially out the stream, because the combined result needs a proper PDF file header, metadata, etc.
We use the iText library to combine the files, starting off with
PdfReader reader = new PdfReader(/*String*/fileName);
reader.consolidateNamedDestinations();
We later loop through adding pages from each pdf to the new combined destination pdf, adjusting the bookmark / page numbers as we go.
AFAIK, FOP doesn't provide this sort of functionality.

Categories