Obtain Folder size in Azure Data Lake Gen2 using Java

Obtain Folder size in Azure Data Lake Gen2 using Java - java

There is some literature over the internet for C# to compute folder size. But could not find Java.
Is there an easy way to know the folder size? in Gen2
How to compute if not?
There are several examples on the internet for (2) with C# and powershell. Any means with Java?

As far as I am aware, there is no API that directly provides the folder size in Azure Data Lake Gen2.
To do it recursively:
DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClientBuilder()
.credential(new StorageSharedKeyCredential(storageAccountName, secret))
.endpoint(endpoint)
.buildClient();
DataLakeFileSystemClient container = dataLakeServiceClient.getFileSystemClient(containerName);
/**
* Returns the size in bytes
*
* #param folder
* #return
*/
#Beta
public Long getSize(String folder) {
DataLakeDirectoryClient directoryClient = container.getDirectoryClient(folder);
if (directoryClient.exists()) {
AtomicInteger count = new AtomicInteger();
return directoryClient.listPaths(true, false, null, null)
.stream()
.filter(x -> !x.isDirectory())
.mapToLong(PathItem::getContentLength)
.sum();
}
throw new RuntimeException("Not a valid folder: " + folder);
}
This recursively iterates through the folders and obtains the size.
The default records per page is 5000. So if there are 12000 records (folders + files combined), it would need to make 3 API calls to fetch details. From the docs:
recursive – Specifies if the call should recursively include all
paths.
userPrincipleNameReturned – If "true", the user identity values
returned in the x-ms-owner, x-ms-group, and x-ms-acl response headers
will be transformed from Azure Active Directory Object IDs to User
Principal Names. If "false", the values will be returned as Azure
Active Directory Object IDs. The default value is false. Note that
group and application Object IDs are not translated because they do
not have unique friendly names.
maxResults – Specifies the maximum
number of blobs to return per page, including all BlobPrefix elements.
If the request does not specify maxResults or specifies a value
greater than 5,000, the server will return up to 5,000 items per page.
If iterating by page, the page size passed to byPage methods such as
PagedIterable.iterableByPage(int) will be preferred over this value.
timeout – An optional timeout value beyond which a RuntimeException
will be raised.

Related

How to use key marker in ListObjectVersions, AWS S3 Java SDK

I am working on a task to delete all the PDF document versions given their document key(Unique for each PDF) using AWS Java SDK.
Other developers have integrated the download code like below
final GetObjectRequest request = GetObjectRequest.builder().bucket(bucketName).key(documentKey).versionId(version).build();
return client.getObject(request);
After searching a bit I found this code to delete single version :-
DeleteObjectRequest request = DeleteObjectRequest.builder().bucket(bucketName)
.key(documentKey).versionId(version).build();
DeleteObjectResponse resp = client.deleteObject(request);
Main question :- How do I get all versions of single documentKey ?
I found ListObjectVersions on
below URL but It accepts a key-marker and not the actual key
key-marker Specifies the key to start with when listing objects in a
bucket.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html
I am just worried if I don't use this properly I might end up deleting something else in Prod.
Edit :- All the PDFs are stored at root level in S3 bucket.

The documentation states:
key-marker
Specifies the key to start with when listing objects in a bucket.
This means by specifying key-marker, you're just starting listing objects at the specified key. ListObjectVersions can and will continue listing objects past the specified key.
Further in the documentation:
prefix
Use this parameter to select only those keys that begin with the specified prefix.
In other words, if you pass a prefix of only the object name, it will return the versions for that object, along with all objects that start with that prefix.
So, you can specify either the prefix of the target object key, or the key marker, but you will need still need to filter in either event to ensure you don't include other objects.

Metadata, content-length for GCS objects

Two things:
I am trying to set custom metadata on a GCS object signed URL.
I am trying to set a maximum file size on a GCS object signed URL.
Using the following code:
Map<String, String> headers = new HashMap<>();
headers.put("x-goog-meta-" + usernameKey, username);
if (StringUtils.hasText(purpose)) {
headers.put("x-goog-meta-" + purposeKey, purpose);
}
if (maxFileSizeMb != null) {
headers.put("x-goog-content-length-range", String.format("0,%d", maxFileSizeMb * 1048576));
}
List<Storage.SignUrlOption> options = new ArrayList<>();
options.add(Storage.SignUrlOption.httpMethod(HttpMethod.POST));
options.add(Storage.SignUrlOption.withExtHeaders(headers));
String documentId = documentIdGenerator.generateDocumentId().getFormatted();
StorageDocument storageDocument =
StorageDocument.builder().id(documentId).path(getPathByDocumentId(documentId)).build();
storageDocument.setFormattedName(documentId);
SignedUrlData.SignedUrlDataBuilder builder =
SignedUrlData.builder()
.signedUrl(storageInterface.signUrl(gcpStorageBucket, storageDocument, options))
.documentId(documentId)
.additionalHeaders(headers);
First of all the generated signed URL works and I can upload a document.
Now I am expecting to see the object metadata through the console view. There is no metadata set though. Also the content-length-range is not respected. I can upload a 1.3 MB file when the content-length-range is set to 0,1.
Something happens when I upload a bigger file (~ 5 MB), but within the content-length-range. I receive an error message: Metadata part is too large..

As you can see here content-length-range requires both a minimum and maximum size. The unit used for the range is bytes, as you can see in this example.
I also noticed that you used x-goog-content-length-range, I found this documentation for it, when using this header take into account:
Use a PUT request, otherwise it will be silently ignored.
If the size of the request's content is outside the specified range, the request fails and a 400 Bad Request code is returned in the response.
You have to set the minimum and maximum size in bytes.

List files changed after a particular timestamp in Google Cloud Storage Bucket

I want to list the files that have changed/added in a Google Cloud Storage Bucket after a particular timestamp in node.js. I was going through documentation https://cloud.google.com/storage/docs/how-to but did not find any mechanism for it.
const {Storage} = require('#google-cloud/storage');
const storage = new Storage();
const bucketName = 'my-bucket';
const [files] = await storage.bucket(bucketName).getFiles();
How can files be listed which were added after a timestamp in a bucket?

This doesn't appear to be possible with the given list API. The documentation doesn't say anything about filtering objects with a date.
It's common to store data about all uploaded files in a database, which is easier to query for whatever properties of the file you would like to store. You can even use a Cloud Functions trigger to automatically write a record into the database for every file upload.

There is no direct function to do so, as Doug Stevenson pointed out, however you can actually know when the file was last modified by searching on its metadata. For example, see this code snippet:
const {Storage} = require('#google-cloud/storage');
const storage = new Storage();
const bucketName = 'your-bucket-name';
storage.bucket(bucketName).getFiles(function(err, files) {
if (!err) {
// files is an array of File objects.
files.forEach(function (file){
file.getMetadata(function (err, metadata){
// TODO: Save only the files before certain metadata.updated date
console.log("File named " + metadata.name +
"last updated on: " + metadata.updated);
});
});
}
});
Then it's up to you to create a condition inside the getMetadata callback function to only list/keep files after a certain metadata.update date threshold.

There's no direct API support for this. However, if you're going to need to do this query frequently, you could manually construct a workflow to keep an index. It would consist of a small application storing an index subscribed to notifications about changes to the bucket, and it'd have an API method that would retrieve objects sorted by date.

Python version with prefix
import argparse
from google.cloud import storage
def list_blobs_with_prefix(bucket_name, prefix,tsAfter):
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=None)
for blob in blobs:
if blob.updated.timestamp() > int(tsAfter):
print(blob.name, blob.updated)
def main(bucket, prefix, tsAfter):
list_blobs_with_prefix(bucket, prefix,tsAfter )
if __name__ == '__main__':
parser =argparse.ArgumentParser(description=__doc__,formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('bucket', help='Your Cloud Storage bucket.')
parser.add_argument('prefix', help='Prefix to match')
parser.add_argument('tsAfter', help='Timestamp After which you want to list the files.')
args = parser.parse_args()
main(args.bucket, args.prefix, args.tsAfter)

Listing files in a specific "folder" of a AWS S3 bucket

I need to list all files contained in a certain folder contained in my S3 bucket.
The folder structure is the following
/my-bucket/users/<user-id>/contacts/<contact-id>
I have files related to users and files related to a certain user's contact.
I need to list both.
To list files I'm using this code:
ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName("my-bucket")
.withPrefix("some-prefix").withDelimiter("/");
ObjectListing objects = transferManager.getAmazonS3Client().listObjects(listObjectsRequest);
To list a certain user's files I'm using this prefix:
users/<user-id>/
and I'm correctly getting all files in the directory excluding contacts subdirectory, for example:
users/<user-id>/file1.txt
users/<user-id>/file2.txt
users/<user-id>/file3.txt
To list a certain user contact's files instead I'm using this prefix:
users/<user-id>/contacts/<contact-id>/
but in this case I'm getting also the
directory itself as a returned object:
users/<user-id>/contacts/<contact-id>/file1.txt
users/<user-id>/contacts/<contact-id>/file2.txt
users/<user-id>/contacts/<contact-id>/
Why am I getting this behaviour? What's different beetween the two listing requests? I need to list only files in the directory, excluding sub-directories.

While everybody say that there are no directories and files in s3, but only objects (and buckets), which is absolutely true, I would suggest to take advantage of CommonPrefixes, described in this answer.
So, you can do following to get list of "folders" (commonPrefixes) and "files" (objectSummaries):
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(bucket.getName()).withPrefix(prefix).withDelimiter(DELIMITER);
ListObjectsV2Result listing = s3Client.listObjectsV2(req);
for (String commonPrefix : listing.getCommonPrefixes()) {
System.out.println(commonPrefix);
}
for (S3ObjectSummary summary: listing.getObjectSummaries()) {
System.out.println(summary.getKey());
}
In your case, for objectSummaries (files) it should return (in case of correct prefix):
users/user-id/contacts/contact-id/file1.txt
users/user-id/contacts/contact-id/file2.txt
for commonPrefixes:
users/user-id/contacts/contact-id/
Reference: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

Everything in S3 is an object. To you, it may be files and folders. But to S3, they're just objects.
Objects that end with the delimiter (/ in most cases) are usually perceived as a folder, but it's not always the case. It depends on the application. Again, in your case, you're interpretting it as a folder. S3 is not. It's just another object.
In your case above, the object users/<user-id>/contacts/<contact-id>/ exists in S3 as a distinct object, but the object users/<user-id>/ does not. That's the difference in your responses. Why they're like that, we cannot tell you, but someone made the object in one case, and didn't in the other. You don't see it in the AWS Management Console because the console is interpreting it as a folder and hiding it from you.
Since S3 just sees these things as objects, it won't "exclude" certain things for you. It's up to the client to deal with the objects as they should be dealt with.
Your Solution
Since you're the one that doesn't want the folder objects, you can exclude it yourself by checking the last character for a /. If it is, then ignore the object from the response.

If your goal is only to take the files and not the folder, the approach I made was to use the file size as a filter. This property is the current size of the file hosted by AWS. All the folders return 0 in that property.
The following is a C# code using linq but it shouldn't be hard to translate to Java.
var amazonClient = new AmazonS3Client(key, secretKey, region);
var listObjectsRequest= new ListObjectsRequest
{
BucketName = 'someBucketName',
Delimiter = 'someDelimiter',
Prefix = 'somePrefix'
};
var objects = amazonClient.ListObjects(listObjectsRequest);
var objectsInFolder = objects.S3Objects.Where(file => file.Size > 0).ToList();

you can check the type. s3 has a special application/x-directory
bucket.objects({:delimiter=>"/", :prefix=>"f1/"}).each { |obj| p obj.object.content_type }

As other have already said, everything in S3 is an object. To you, it may be files and folders. But to S3, they're just objects.
If you don't need objects which end with a '/' you can safely delete them e.g. via REST api or AWS Java SDK (I assume you have write access). You will not lose "nested files" (there no files, so you will not lose objects whose names are prefixed with the key you delete)
AmazonS3 amazonS3 = AmazonS3ClientBuilder.standard().withCredentials(new ProfileCredentialsProvider()).withRegion("region").build();
amazonS3.deleteObject(new DeleteObjectRequest("my-bucket", "users/<user-id>/contacts/<contact-id>/"));
Please note that I'm using ProfileCredentialsProvider so that my requests are not anonymous. Otherwise, you will not be able to delete an object. I have my AWS keep key stored in ~/.aws/credentials file.

S3 does not have directories, while you can list files in a pseudo directory manner like you demonstrated, there is no directory "file" per-se.
You may of inadvertently created a data file called users/<user-id>/contacts/<contact-id>/.

Based on #davioooh answer.
This code is worked for me.
ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName("your-bucket")
.withPrefix("your/folder/path/").withDelimiter("/");

Document feed limit

Is there a limit to the number of entries returned in a DocumentListFeed? I'm getting 100 results, and some of the collections in my account are missing.
How can I make sure I get all of the collections in my account?
DocsService service = new DocsService(APP_NAME);
service.setHeader("Authorization", "Bearer " + accessToken);
String feedUrl = new URL("https://docs.google.com/feeds/default/private/full/-/folder?v=3&showfolders=true&showroot=true");
DocumentLisFeed feed = service.getFeed(feedUrl, DocumentListFeed.class);
List<DocumentListEntry> entries = feed.getEntries();
The size of entries is 100.

A single request to the Documents List feed by default returns 100 element, but you can configure that value by setting the ?max-results query parameter.
Regardless, in order to retrieve all documents and files you should always take into account sending multiple requests, one per page, as explained in the documentation:
https://developers.google.com/google-apps/documents-list/#getting_all_pages_of_documents_and_files
Please also note that it is now recommended to switch to the newer Google Drive API, which interacts with the same resources and has complete documentation and sample code in multiple languages, including Java:
https://developers.google.com/drive/

You can call
feed.getNextLink().getHref()
to get a URL that you an form another feed with. This can be done until the link is null, at which point all the entries have been fetched.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Obtain Folder size in Azure Data Lake Gen2 using Java - java

There is some literature over the internet for C# to compute folder size. But could not find Java. Is there an easy way to know the folder size? in Gen2 How to compute if not? There are several examples on the internet for (2) with C# and powershell. Any means with Java?

Related

How to use key marker in ListObjectVersions, AWS S3 Java SDK

Metadata, content-length for GCS objects

List files changed after a particular timestamp in Google Cloud Storage Bucket

Listing files in a specific "folder" of a AWS S3 bucket

Document feed limit

Categories

Resources