Stream content to Google Cloud Storage - java

I would like to upload a large Set<Integer> to Google Cloud Storage. I can do that with:
Blob result = storage.create(blobInfo, Joiner.on('\n').join(set).getBytes(UTF_8));
But this will create an intermediate String with all the content that might be too large.
I found an example with WriteChannel.write():
Set<Integer> set = ...
String bucketName = "my-unique-bucket";
String blobName = "my-blob-name";
BlobId blobId = BlobId.of(bucketName, blobName);
byte[] content = Joiner.on('\n').join(set).getBytes(UTF_8);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).setContentType("text/plain").build();
try (WriteChannel writer = storage.writer(blobInfo)) {
writer.write(ByteBuffer.wrap(content, 0, content.length));
} catch (IOException ex) {
// handle exception
}
However, if I do that, the entire set is converted to a String and then to byte[]. The String itself might be too big.
Is there an example how to iterate over the set and transform it to a ByteBuffer? or should I do a loop on chunks of the set?

The most straightforward approach I could think of would be:
try (WriteChannel writer = storage.writer(blobInfo)) {
for(Integer val : set) {
String valLine = val.toString() + '\n';
writer.write(ByteBuffer.wrap(valLine.getBytes(UTF_8));
}
}
Mind you, this isn't very efficient. It creates a lot of small ByteBuffers. You could greatly improve on this by writing into a single larger ByteBuffer and periodically calling writer.write with it.

To avoid creating an intermediate String with all the bytes you can upload from a file. You can find example code to do an upload from a file in various languages here.

Related

How to upload a file to GCS only if it is of a specific version in Java?

I have a case of using GCS to store files to upload files. I would like to avoid race conditions by using generations and preconditions (x-goog-if-generation-match).
There is an explanation how to do that in the docs.
However, I am using Java API and the docs only shows json/xml api.
The solution I found is doing something like this:
Blob object =
storage.get(
bucketName,
filePath,
Storage.BlobGetOption.fields(BlobField.GENERATION));
long generation = object.getGeneration();
// now do some other stuff that might cause race conditions...
BlobId blobId = BlobId.of(bucketName, objectName, generation);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).setContentType("text/plain").build();
List<Storage.BlobWriteOption> blobWriteOptions = new ArrayList<>();
blobWriteOptions.add(BlobWriteOption.generationMatch());
// in case generation don't match this will throw "StorageException: 412 Precondition Failed"
try (WriteChannel writer =
storage.writer(blobInfo, blobWriteOptions.toArray(new BlobWriteOption[0]))) {
writer.write(); // etc'
}
}

Upload multiple blobs to Azure Storage

I have the following code to upload single blob to azure storage using azure-storage-blob 12.5.0.
Is there any way to pass a collection of byte arrays and do it in some kind of batch upload?
public void store(final String blobPath, final String originalFileName, final byte[] bytes) {
final BlobClient blobClient = containerClient.getBlobClient(blobPath);
final BlockBlobClient blockBlobClient = blobClient.getBlockBlobClient();
try (ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes)) {
blockBlobClient.upload(inputStream, bytes.length, true);
} catch (BlobStorageException | IOException exc) {
throw new StorageException(exc);
}
}
Is there any way to pass a collection of byte arrays and do it in some
kind of batch upload?
In V8 sdk, i found uploadFromByteArray method supports byte[] parameter.
CloudBlockBlob blob = container.getBlockBlobReference("helloV8.txt");
String str1 = "132";
String str2 = "asd";
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(str1.getBytes());
os.write(str2.getBytes());
byte[] byteArray = os.toByteArray();
blob.uploadFromByteArray(byteArray, 0, byteArray.length);
Test:
No such method could be found in V12 sdk,only upload method you used in your question.In fact,in above uploadFromByteArray method inside,it is upload method as well.
If you are referring upload multiple blobs in the batch,i'm afraid it it not supported in the official sdk except using for loop.About bulk writing,you could refer to the Azure CLI and AzCopy scenarios mentioned in this document.

How to properly open a png file

I am trying to attach a png file. Currently when I sent the email, the attachment is 2x bigger than the file should be and an invalid png file. Here is the code I currently have:
import com.sendgrid.*;
Attachments attachments = new Attachments();
String filePath = "/Users/david/Desktop/screenshot5.png";
String data = "";
try {
data = new String(Files.readAllBytes(Paths.get(filePath)));
} catch (IOException e) {
}
byte[] encoded = Base64.encodeBase64(data.getBytes());
String encodedString = new String(encoded);
attachments.setContent(encodedString);
Perhaps I am encoding the data incorrectly? What would be the correct way to 'get' the data to attach it?
With respect, this is why Python presents a problem to modern developers. It abstracts away important concepts that you can't fully understand in interpreted languages.
First, and this is a relatively basic concept, but you can't convert arbitrary byte sequences to a string and hope it works out. The following line is your first problem:
data = new String(Files.readAllBytes(Paths.get(filePath)));
EDIT: It looks like the library you are using expects the file to be base64 encoded. I have no idea why. Try changing your code to this:
Attachments attachments = new Attachments();
String filePath = "/Users/david/Desktop/screenshot5.png";
try {
byte[] encoded = Base64.encodeBase64(Files.readAllBytes(Paths.get(filePath)));
String encodedString = new String(encoded);
attachments.setContent(encodedString);
} catch (IOException e) {
}
The only issue you were having is that you were trying to represent arbitrary bytes as a string.
Take a look at the Builder class in the repository here. Example:
FileInputStream fileContent = new FileInputStream(filePath);
Attachments.Builder builder = new Attachments.Builder(fileName, fileContent);
mail.addAttachments(builder.build());

How to download a large file from Google Cloud Storage using Java with checksum control

I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId), during blob.reader() or during reader.read(bytes)? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}
The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read(), which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil command-line tool, which does this same sort of additional check.
As the JavaDoc of ReadChannel says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader() could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read(). You will never know and you shouldn't care.
As only read() throws an IOException and the other methods you used do not, I'd say that only calling read() will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0, not > 0. 0 just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.

JaxRS create and return zip file from server

I want to create and return a zip file from my server using JaxRS. I don't think that I want to create an actual file on the server, if possible I would like to create the zip on the fly and pass that back to the client. If I create a huge zip file on the fly will I run out of memory if too many files are in the zip file?
Also I am not sure the most efficient way to do this. Here is what I was thinking but I am very rusty when it comes to input/output in java.
public Response getFiles() {
// These are the files to include in the ZIP file
String[] filenames = // ... bunch of filenames
byte[] buf = new byte[1024];
try {
// Create the ZIP file
ByteArrayOutputStream baos= new ByteArrayOutputStream();
ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(baos));
// Compress the files
for (String filename : filenames) {
FileInputStream in = new FileInputStream(filename);
// Add ZIP entry to output stream.
out.putNextEntry(new ZipEntry(filename));
// Transfer bytes from the file to the ZIP file
int len;
while ((len = in.read(buf)) > 0) {
out.write(buf, 0, len);
}
// Complete the entry
out.closeEntry();
in.close();
}
// Complete the ZIP file
out.close();
ResponseBuilder response = Response.ok(out); // Not a 100% sure this will work
response.type(MediaType.APPLICATION_OCTET_STREAM);
response.header("Content-Disposition", "attachment; filename=\"files.zip\"");
return response.build();
} catch (IOException e) {
}
}
Any help would be greatly appreciated.
There are two options:
1- Create ZIP in a temporal directory and then dump to client.
2- Use OutputStream from the Response to send zip directly to the client, when you are creating them.
But never use memory to create huge ZIP file.
There's no need to create the ZIP file from the first to the last byte in the memory before serving it to the client. Also, there's no need to create such a file in temp directory in advance as well (especially because the IO might be really slow).
The key is to start streaming the "ZIP response" and generating the content on the flight.
Let's say we have a aMethodReturningStream(), which returns a Stream, and we want to turn each element into a file stored in the ZIP file. And that we don't want to keep bytes of each element stored all the time in any intermediate representation, like a collection or an array.
Then such a pseudocode might help:
#GET
#Produces("application/zip")
public Response generateZipOnTheFly() {
StreamingOutput output = strOut -> {
try (ZipOutputStream zout = new ZipOutputStream(strOut)) {
aMethodReturningStream().forEach(singleStreamElement -> {
try {
ZipEntry zipEntry = new ZipEntry(createFileName(singleStreamElement));
FileTime fileTime = FileTime.from(singleStreamElement.getCreationTime());
zipEntry.setCreationTime(fileTime);
zipEntry.setLastModifiedTime(fileTime);
zout.putNextEntry(zipEntry);
zout.write(singleStreamElement.getBytes());
zout.flush();
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
};
return Response.ok(output)
.header("Content-Disposition", "attachment; filename=\"generated.zip\"")
.build();
}
This concept relies on passing a StreamingOutput to the Response builder. The StreamingOutput is not a full response/entity/body generated before sending the response, but a recipe used to generate the flow of bytes on-the-fly (here wrapped into ZipOutputStream). If you're not sure about this, then maybe set a breakpoint next on flush() and observe the a download progress using e.g. wget.
The key thing to remember here is that the stream here is not a "wrapper" of pre-computed or pre-fetched items. It must be dynamic, e.g. wrapping a DB cursor or something like that. Also, it can be replaced by anything that's streaming data. That's why it cannot be a foreach loop iterating over Element[] elems array (with each Element having all the bytes "inside"), like
for(Element elem: elems)
if you'd like to avoid reading all items into the heap at once before streaming the ZIP.
(Please note this is a pseudocode and you might want to add better handling and polish other stuff as well.)

Categories