My method receives a buffered reader and transforms each line in my file. However I need to upload the output of this transformation to an s3 bucket. The files are quite large so I would like to be able to stream my upload into an s3 object.
To do so, I think I need to use a multipart upload however I'm not sure I'm using it correctly as nothing seems to get uploaded.
Here is my method:
public void transform(BufferedReader reader)
{
Scanner scanner = new Scanner(reader);
String row;
List<PartETag> partETags = new ArrayList<>();
InitiateMultipartUploadRequest request = new InitiateMultipartUploadRequest("output-bucket", "test.log");
InitiateMultipartUploadResult result = amazonS3.initiateMultipartUpload(request);
while (scanner.hasNext()) {
row = scanner.nextLine();
InputStream inputStream = new ByteArrayInputStream(row.getBytes(Charset.forName("UTF-8")));
log.info(result.getUploadId());
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName("output-bucket")
.withKey("test.log")
.withUploadId(result.getUploadId())
.withInputStream(inputStream)
.withPartNumber(1)
.withPartSize(5 * 1024 * 1024);
partETags.add(amazonS3.uploadPart(uploadRequest).getPartETag());
}
log.info(result.getUploadId());
CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(
"output-bucket",
"test.log",
result.getUploadId(),
partETags);
amazonS3.completeMultipartUpload(compRequest);
}
Oh, I see. The InitiateMultipartUploadRequest needs to read from an input stream. This is a valid constraint, since you can only write to output streams in general.
You probably heard that you can copy data from InputStream to ByteArrayOutputStream. Then take the resulting byte-array and create an ByteArrayInputStream. You could feed this to your request object. BUT: All data will in one byte array at a certain time. Since your use case is about large files, this cannot be o.k.
What you need is to create a custom input stream class which transforms the original input stream into another input stream. It requires you to work on a byte level abstraction. It would however offer the best performance. I suggest to ask a new question if you like to know more about that.
Your transformation code is already finished and you don't want to touch it again? There is another approach. You could also just "connect" an output stream to an input stream by using pipes: https://howtodoinjava.com/java/io/convert-outputstream-to-inputstream-example/. The catch: you are dealing with multi-threading here.
Related
I am currently developing a REST service which receives in its request a field where it is passed a file in base 64 format ("n" characters come). What I do within the service logic is to convert that character string to a File to save it in a predetermined path.
But the problem is that when the file is too large (3MB) the service becomes slow and takes a long time to respond.
This is the code I am using:
String filename = "TEXT.DOCX"
BufferedOutputStream stream = null;
// THE FIELD base64file IS WHAT A STRING IN BASE FORMAT COMES FROM THE REQUEST 64
byte [] fileByteArray = java.util.Base64.getDecoder (). decode (base64file);
// VALID FILE SIZE
if ((1 * 1024 * 1024 <fileByteArray.length) {
logger.info ("The file [" + filename + "] is too large");
} else {
stream = new BufferedOutputStream (new FileOutputStream (new File ("C: \" + filename)));
stream.write (fileByteArray);
}
How can I do to avoid this inconvenience. And that my service does not take so long to convert the file to File.
Buffering does not improve your performance here, as all you are trying to do is simply write the file as fast as possible. Generally it looks fine, change your code to directly use the FileOutputStream and see if it betters things:
try (FileOutputStream stream = new FileOutputStream(path)) {
stream.write(bytes);
}
Alternatively you could also try using something like Apache Commons to do the task for you:
FileUtils.writeByteArrayToFile(new File(path), bytes);
Try the following, also for large files.
Path outPath = Paths.get(filename);
try (InputStream in = Base64.getDecoder ().wrap(base64file)) {
Files.copy(in, outPath);
}
This keeps only a buffer in memory. Your code might become slow because of taking more memory.
wrap takes an InputStream which you should provide, not the entire String.
From Network point of view:
Both json and xml can support large amount of data exchange. And, 3MB is not really huge. But, there is a limitation on how much browser can handle (if this call is from a user interface).
Also, web server like Tomcat has property to handle 2MB by default (check maxPostSize http://tomcat.apache.org/tomcat-6.0-doc/config/http.html#Common_Attributes)
You can also try chunking the request payload (although it shouldn't be required for a 3MB file)
From Implementation point of view:
Write operation on your disk could be slow. It also depends on your OS.
If your file size is really large, you can use Java FileChannel class with ByteBuffer.
To know the cause of slowness (network delay or code), check the performance with a simple Java program against the web service call.
Consider the following code snippet getInputStreamForRead() method creates and returns a new input stream for read.
InputStream is = getInputStreamForRead(); //This method creates and returns an input stream for file read
is = getDecompressedStream(is);
Since the orginal file content is compressed and stored it has to be decompressed while reading. Hence getDecompressedStream() method below would provide option to decompress the stream content
public InputStream getDecompressedStream(InputStream is) throws Exception {
return new GZIPInputStream(is);
}
Have the following doubts
Which one is correct for the above snippet
is = getDecompressedStream(is)
or
InputStream newStream = getDecompressedStream(is)
Will reusing the InputStream variable again cause any trouble?
I'm completely new with streams. Kindly help me to know about this.
As long as:
you're not manipulating the original InputStream between the original assignment and the new invocation
you're always closing your streams in a finally statement
... you should be fine re-assigning to the original variable - it's just a new value passed to an existing reference.
In fact, that may be the recommended way, since you get to only close one Closeable programmatically, as GZIPInputStream#close...
Closes this input stream and releases any system resources associated with the stream.
(see here - I read this as, "closes the underlying stream").
Since you want to close the input stream correctly, the best way is to create the input stream using chaining, and using a try-with-resources to handle the close for you.
try (InputStream is = getDecompressedStream(getInputStreamForRead())) {
// code using stream here
}
It seems that there are many, many ways to read text files in Java (BufferedReader, DataInputStream etc.) My personal favorite is Scanner with a File in the constructor (it's just simpler, works with mathy data processing better, and has familiar syntax).
Boris the Spider also mentioned Channel and RandomAccessFile.
Can someone explain the pros and cons of each of these methods? To be specific, when would I want to use each?
(edit) I think I should be specific and add that I have a strong preference for the Scanner method. So the real question is, when wouldn't I want to use it?
Lets start at the beginning. The question is what do you want to do?
It's important to understand what a file actually is. A file is a collection of bytes on a disc, these bytes are your data. There are various levels of abstraction above that that Java provides:
File(Input|Output)Stream - read these bytes as a stream of byte.
File(Reader|Writer) - read from a stream of bytes as a stream of char.
Scanner - read from a stream of char and tokenise it.
RandomAccessFile - read these bytes as a searchable byte[].
FileChannel - read these bytes in a safe multithreaded way.
On top of each of those there are the Decorators, for example you can add buffering with BufferedXXX. You could add linebreak awareness to a FileWriter with PrintWriter. You could turn an InputStream into a Reader with an InputStreamReader (currently the only way to specify character encoding for a Reader).
So - when wouldn't I want to use it [a Scanner]?.
You would not use a Scanner if you wanted to, (these are some examples):
Read in data as bytes
Read in a serialized Java object
Copy bytes from one file to another, maybe with some filtering.
It is also worth nothing that the Scanner(File file) constructor takes the File and opens a FileInputStream with the platform default encoding - this is almost always a bad idea. It is generally recognised that you should specify the encoding explicitly to avoid nasty encoding based bugs. Further the stream isn't buffered.
So you may be better off with
try (final Scanner scanner = new Scanner(new BufferedInputStream(new FileInputStream())), "UTF-8") {
//do stuff
}
Ugly, I know.
It's worth noting that Java 7 Provides a further layer of abstraction to remove the need to loop over files - these are in the Files class:
byte[] Files.readAllBytes(Path path)
List<String> Files.readAllLines(Path path, Charset cs)
Both these methods read the entire file into memory, which might not be appropriate. In Java 8 this is further improved by adding support for the new Stream API:
Stream<String> Files.lines(Path path, Charset cs)
Stream<Path> Files.list(Path dir)
For example to get a Stream of words from a Path you can do:
final Stream<String> words = Files.lines(Paths.get("myFile.txt")).
flatMap((in) -> Arrays.stream(in.split("\\b")));
SCANNER:
can parse primitive types and strings using regular expressions.
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types.more can be read at http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
DATA INPUT STREAM:
Lets an application read primitive Java data types from an underlying input stream in a machine-independent way. An application uses a data output stream to write data that can later be read by a data input stream.DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class. More can be read at http://docs.oracle.com/javase/7/docs/api/java/io/DataInputStream.html
BufferedReader:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.Programs that use DataInputStreams for textual input can be localized by replacing each DataInputStream with an appropriate BufferedReader.More detail are at http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
NOTE: This approach is outdated. As Boris points out in his comment. I will leave it here for history, but you should use methods available in JDK.
It depends on what kind of operation you are doing and the size of the file you are reading.
In most of the cases, I recommend using commons-io for small files.
byte[] data = FileUtils.readFileToByteArray(new File("myfile"));
You can read it as string or character array...
Now, you are handing big files, or changing parts of a file directly on the filesystem, then the best it to use a RandomAccessFile and potentially even a FileChannel to do "nio" style.
Using BufferedReader
BufferedReader reader;
char[] buffer = new char[10];
reader = new BufferedReader(new FileReader("FILE_PATH"));
//or
reader = Files.newBufferedReader(Path.get("FILE_PATH"));
while (reader.read(buffer) != -1) {
System.out.print(new String(buffer));
buffer = new char[10];
}
//or
while (buffReader.ready()) {
System.out.println(
buffReader.readLine());
}
reader.close();
Using FileInputStream-Read Binary Files to Bytes
FileInputStream fis;
byte[] buffer = new byte[10];
fis = new FileInputStream("FILE_PATH");
//or
fis=Files.newInoutSream(Paths.get("FILE_PATH"))
while (fis.read(buffer) != -1) {
System.out.print(new String(buffer));
buffer = new byte[10];
}
fis.close();
Using Files– Read Small File to List of Strings
List<String> allLines = Files.readAllLines(Paths.get("FILE_PATH"));
for (String line : allLines) {
System.out.println(line);
}
Using Scanner – Read Text File as Iterator
Scanner scanner = new Scanner(new File("FILE_PATH"));
while (scanner.hasNextLine()) {
System.out.println(scanner.nextLine());
}
scanner.close();
Using RandomAccessFile-Reading Files in Read-Only Mode
RandomAccessFile file = new RandomAccessFile("FILE_PATH", "r");
String str;
while ((str = file.readLine()) != null) {
System.out.println(str);
}
file.close();
Using Files.lines-Reading lines as stream
Stream<String> lines = Files.lines(Paths.get("FILE_PATH") .forEach(s -> System.out.println(s));
Using FileChannel-for increasing performance by using off-heap memory furthermore using MappedByteBuffer
FileInputStream i = new FileInputStream(("FILE_PATH");
ReadableByteChannel r = i.getChannel();
ByteBuffer buffer = ByteBuffer.allocateDirect(16 * 1024);
while (r.read(buffer) != -1) {
buffer.flip();
while (buffer.hasRemaining()) {
System.out.print((char) buffer.get());
}
buffer.clear();
}
I wanted to know about the difference between these two block of lines of codes.
byte[] fileBytes = FileUtils.readFileToByteArray(new File(completeFilePath.toString()));
..
return new FileTransfer(errorFileName, "application/vnd.ms-excel", is);
and
File csvFile = new File(completeFilePath.toString());
InputStream is = new BufferedInputStream(new FileInputStream(csvFile));
return new FileTransfer(errorFileName, "application/vnd.ms-excel", is);
any advantage and disadvantage for either of them is welcome to clear out detail.
Thanks in Advance.
FileTransfer has multiple constructors which expect different parameters.
Your first example calls the constructor which takes the content as a byte array (byte[]).
Your second example calls the constructor which takes an InputStream and will read the content itself from the passed InputStream.
If your file is big, obviously don't use the first one because it requires the whole file to be read into the memory.
The second approach seems better in all cases except if you also need the file content, then you would have to read it twice.
I'm currently working on a project that is done in Java, on google appengine.
Appengine does not allow files to be stored so any on-disk representation objects cannot be used. Some of these include the File class.
I want to write data and export it to a few csv files, and then zip it up, and allow the user to download it.
How may I do this without using any File classes? I'm not very experienced in file handling so I hope you guys can advise me.
Thanks.
You can create a zip file and add to it while the user is downloading it. If you are using a servlet, this is straigthforward:
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
// ..... process request
// ..... then respond
response.setContentType("application/zip");
response.setStatus(HttpServletResponse.SC_OK);
// note : intentionally no content-length set, automatic chunked transfer if stream is larger than the internal buffer of the response
ZipOutputStream zipOut = new ZipOutputStream(response.getOutputStream());
byte[] buffer = new byte[1024 * 32];
try {
// case1: already have input stream, typically ByteArrayInputStream from a byte[] full of previoiusly prepared csv data
InputStream in = new BufferedInputStream(getMyFirstInputStream());
try {
zipOut.putNextEntry(new ZipEntry("FirstName"));
int length;
while((length = in.read(buffer)) != -1) {
zipOut.write(buffer, 0, length);
}
zipOut.closeEntry();
} finally {
in.close();
}
// case 2: write directly to output stream, i.e. you have your raw data but need to create csv representation
zipOut.putNextEntry(new ZipEntry("SecondName"));
// example setup, key is to use the below outputstream 'zipOut' write methods
Object mySerializer = new MySerializer(); // i.e. csv-writer
Object myData = getMyData(); // the data to be processed by the serializer in order to make a csv file
mySerizalier.setOutput(zipOut);
// write whatever you have to the zipOut
mySerializer.write(myData);
zipOut.closeEntry();
// repeat for the next file.. or make for-loop
}
} finally {
zipOut.close();
}
}
There is no reason to store your data in files unless you have memory constraints. Files give you InputStream and OutputStream, both which have in-memory equivalents.
Note that creating a csv writer usually means doing something like this, where the point is to take a piece of data (array list or map, whatever you have) and make it into byte[] parts. Append the byte[] parts into an OutputStream using a tool like DataOutputStream (make your own if you like) or OutputStreamWriter.
If your data is not huge, meaning can stay in memory then exporting to CSV and zipping up and streaming it for downloading can all be done on-they-fly. Caching can be done at any of these steps which greatly depends on your application's business logic.