I am attempting to implement a relatively simple ETL pipeline that iterates through files in a google cloud bucket. The bucket has two folders: /input and /output.
What I'm trying to do is write a Java/Scala script to iterate through files in /input, and have the transformation applied to those that are not present in /output or those that have a timestamp later than that in /output. I've been looking through the Java API doc for a function I can leverage (as opposed to just calling gsutil ls ...), but haven't had any luck so far. Any recommendations on where to look in the doc?
Edit: There is a better way to do this than using data transfer objects:
public Page<Blob> listBlobs() {
// [START listBlobs]
Page<Blob> blobs = bucket.list();
for (Blob blob : blobs.iterateAll()) {
// do something with the blob
}
// [END listBlobs]
return blobs;
}
Old method:
def getBucketFolderContents(
bucketName: String
) = {
val credential = getCredential
val httpTransport = GoogleNetHttpTransport.newTrustedTransport()
val requestFactory = httpTransport.createRequestFactory(credential)
val uri = "https://www.googleapis.com/storage/v1/b/" + URLEncoder.encode(
bucketName,
"UTF-8") +
"o/raw%2f"
val url = new GenericUrl(uri)
val request = requestFactory.buildGetRequest(uri)
val response = request.execute()
response
}
}
You can list objects under a folder by setting the prefix string on the object listing API: https://cloud.google.com/storage/docs/json_api/v1/objects/list
The results of listing are sorted, so you should be able to list both folders and then walk through both in order and generate the diff list.
Related
I am using aws dynamodb akka persistence API https://github.com/akka/akka-persistence-dynamodb which doesn't have a read journal API like Cassandra (Akka Persistence Query).
I can write journal data to dynamodb the event column is in string java object format my next task is to build CQRS using aws lambda or AWS Java API to read dynamodb, which has to convert the event data to human readble format.
Event Data:-
rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AywIAAUwAC2JhbmtBY2NvdW50dAA6TGNvbS9jYXBvbmUvYmFuay9hY3RvcnMvUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW50O3hwc3IAOGNvbS5jYXBvbmUuYmFuay5hY3RvcnMuUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW5011CikshX3ysCAAREAAdiYWxhbmNlTAAIY3VycmVuY3l0ABJMamF2YS9sYW5nL1N0cmluZztMAAJpZHEAfgAETAAEdXNlcnEAfgAEeHBAj0AAAAAAAHQAA0VVUnQAJDM5M2M2NmRiLTJhYmItNDEwNS04NWUyLWMwZjc3MzExMDNlM3QAB3JjYXJkaW4=
I want to know how to convert the above Java Object string value to human-reable format ? I tried using Java objectinputstream but I think I am doing something wrong.
Scala example:-
val eventData:String = "rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AywIAAUwAC2JhbmtBY2NvdW50dAA6TGNvbS9jYXBvbmUvYmFuay9hY3RvcnMvUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW50O3hwc3IAOGNvbS5jYXBvbmUuYmFuay5hY3RvcnMuUGVyc2lzdGVudEJhbmtBY2NvdW50JEJhbmtBY2NvdW5011CikshX3ysCAAREAAdiYWxhbmNlTAAIY3VycmVuY3l0ABJMamF2YS9sYW5nL1N0cmluZztMAAJpZHEAfgAETAAEdXNlcnEAfgAEeHBAj0AAAAAAAHQAA0VVUnQAJDM5M2M2NmRiLTJhYmItNDEwNS04NWUyLWMwZjc3MzExMDNlM3QAB3JjYXJkaW4="
??? (and then what how to convert above string value to human reable format)
Thanks
Sri
ok was able to deserialize the object string data and convert it to json below is an example
object DeserializeData extends App {
import java.io.ByteArrayInputStream
import java.io.InputStream
import java.io.ObjectInputStream
import java.util.Base64
import com.google.gson.Gson
val base64encodedString = "rO0ABXNyAD9jb20uY2Fwb25lLmJhbmsuYWN0b3JzLlBlcnNpc3RlbnRCYW5rQWNjb3VudCRCYW5rQWNjb3VudENyZWF0ZWQrGoMniq0AlM3QAB3JjYXJkaW4="
println("Base64 encoded string :" + base64encodedString)
// Decode
val base64decodedBytes = Base64.getDecoder.decode(base64encodedString)
val in = new ByteArrayInputStream(base64decodedBytes)
val obin = new ObjectInputStream(in)
val `object` = obin.readObject
println("Deserialised data: \n" + `object`.toString)
// You could also try...
println("Object class is " + `object`.getClass.toString)
val json = new Gson();
val resp = json.toJson(`object`)
println(resp)
}
A feature to read aws Dynamodb read journal is now implemented no need of any kind of clunky code https://github.com/akka/akka-persistence-dynamodb/pull/114/files thank you Lightbend
As i am using v3 of google api,So instead of using parent and chidren list i have to use fileList, So now i want to search list of file inside a specific folder.
So someone can suggest me what to do?
Here is the code i am using to search the file :
private String searchFile(String mimeType,String fileName) throws IOException{
Drive driveService = getDriveService();
String fileId = null;
String pageToken = null;
do {
FileList result = driveService.files().list()
.setQ(mimeType)
.setSpaces("drive")
.setFields("nextPageToken, files(id, name)")
.setPageToken(pageToken)
.execute();
for(File f: result.getFiles()) {
System.out.printf("Found file: %s (%s)\n",
f.getName(), f.getId());
if(f.getName().equals(fileName)){
//fileFlag++;
fileId = f.getId();
}
}
pageToken = result.getNextPageToken();
} while (pageToken != null);
return fileId;
}
But in this method it giving me all the files that are generated which i don't want.I want to create a FileList which will give file inside a specific folder.
It is now possible to do it with the term parents in q parameter in drives:list. For example, if you want to find all spreadsheets in a folder with id folder_id you can do so using the following q parameter (I am using python in my example):
q="mimeType='application/vnd.google-apps.spreadsheet' and parents in '{}'".format(folder_id)
Remember that you should find out the id of the folder files inside of which you are looking for. You can do this using the same drives:list.
More information on drives:list method can be seen here, and you can read more about other terms you can put to q parameter here.
To search in a specific directory you have to specify the following:
q : name = '2021' and mimeType = 'application/vnd.google-apps.folder' and '1fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG' in parents
This examples search a folder called "2021" into folder with 1fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG
In my case, I'm writing a code in c++ and the request url would be:
string url = "https://www.googleapis.com/drive/v3/files?q=name+%3d+%272021%27+and+mimeType+%3d+%27application/vnd.google-apps.folder%27+and+trashed+%3d+false+and+%271fJ9TFZOe8G9PUMfC2Ts06sRnEPJQo7zG%27+in+parents";
Searching files by folder name is not yet supported. It's been requested in this google forum but so far, nothing yet. However, try to look for other alternative search filters available in Search for Files.
Be creative. For example make sure the files within a certain folder contains a unique keyword which you can then query using
fullText contains 'my_unique_keyword'
You can use this method to search the files from google drive:
Files.List request = this.driveService.files().list();
noOfRecords = 100;
request.setPageSize(noOfRecords);
request.setPageToken(nextPageToken);
String searchQuery = "(name contains 'Hello')";
if (StringUtils.isNotBlank(searchQuery)) {
request.setQ(searchQuery);
}
request.execute();
I am learning Amazon Cloud Search but I couldn't find any code in either C# or Java (though I am creating in C# but if I can get code in Java then I can try converting in C#).
This is just 1 code I found in C#: https://github.com/Sitefinity-SDK/amazon-cloud-search-sample/tree/master/SitefinityWebApp.
This is 1 method i found in this code:
public IResultSet Search(ISearchQuery query)
{
AmazonCloudSearchDomainConfig config = new AmazonCloudSearchDomainConfig();
config.ServiceURL = "http://search-index2-cdduimbipgk3rpnfgny6posyzy.eu-west-1.cloudsearch.amazonaws.com/";
AmazonCloudSearchDomainClient domainClient = new AmazonCloudSearchDomainClient("AKIAJ6MPIX37TLIXW7HQ", "DnrFrw9ZEr7g4Svh0rh6z+s3PxMaypl607eEUehQ", config);
SearchRequest searchRequest = new SearchRequest();
List<string> suggestions = new List<string>();
StringBuilder highlights = new StringBuilder();
highlights.Append("{\'");
if (query == null)
throw new ArgumentNullException("query");
foreach (var field in query.HighlightedFields)
{
if (highlights.Length > 2)
{
highlights.Append(", \'");
}
highlights.Append(field.ToUpperInvariant());
highlights.Append("\':{} ");
SuggestRequest suggestRequest = new SuggestRequest();
Suggester suggester = new Suggester();
suggester.SuggesterName = field.ToUpperInvariant() + "_suggester";
suggestRequest.Suggester = suggester.SuggesterName;
suggestRequest.Size = query.Take;
suggestRequest.Query = query.Text;
SuggestResponse suggestion = domainClient.Suggest(suggestRequest);
foreach (var suggest in suggestion.Suggest.Suggestions)
{
suggestions.Add(suggest.Suggestion);
}
}
highlights.Append("}");
if (query.Filter != null)
{
searchRequest.FilterQuery = this.BuildQueryFilter(query.Filter);
}
if (query.OrderBy != null)
{
searchRequest.Sort = string.Join(",", query.OrderBy);
}
if (query.Take > 0)
{
searchRequest.Size = query.Take;
}
if (query.Skip > 0)
{
searchRequest.Start = query.Skip;
}
searchRequest.Highlight = highlights.ToString();
searchRequest.Query = query.Text;
searchRequest.QueryParser = QueryParser.Simple;
var result = domainClient.Search(searchRequest).SearchResult;
//var result = domainClient.Search(searchRequest).SearchResult;
return new AmazonResultSet(result, suggestions);
}
I have already created domain in Amazon Cloud Search using AWS console and uploaded document using Amazon predefine configuration option that is movie Imdb json file provided by Amazon for demo.
But in this method I am not getting how to use this method, like if I want to search Director name then how do I pass in this method as because this method parameter is of type ISearchQuery?
I'd suggest using the official AWS CloudSearch .NET SDK. The library you were looking at seems fine (although I haven't look at it any detail) but the official version is more likely to expose new CloudSearch features as soon as they're released, will be supported if you need to talk to AWS support, etc, etc.
Specifically, take a look at the SearchRequest class -- all its params are strings so I think that obviates your question about ISearchQuery.
I wasn't able to find an example of a query in .NET but this shows someone uploading docs using the AWS .NET SDK. It's essentially the same procedure as querying: creating and configuring a Request object and passing it to the client.
EDIT:
Since you're still having a hard time, here's an example. Bear in mind that I am unfamiliar with C# and have not attempted to run or even compile this but I think it should at least be close to working. It's based off looking at the docs at http://docs.aws.amazon.com/sdkfornet/v3/apidocs/
// Configure the Client that you'll use to make search requests
string queryUrl = #"http://search-<domainname>-xxxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.cloudsearch.amazonaws.com";
AmazonCloudSearchDomainClient searchClient = new AmazonCloudSearchDomainClient(queryUrl);
// Configure a search request with your query
SearchRequest searchRequest = new SearchRequest();
searchRequest.Query = "potato";
// TODO Set your other params like parser, suggester, etc
// Submit your request via the client and get back a response containing search results
SearchResponse searchResponse = searchClient.Search(searchRequest);
I am using AWS Java SDK to interact with S3. I want to iterate through all the objects in the storage and retrieve metadata of each object. I can iterate through the objects using lists as:
ObjectListing list= s3client.listObjects("bucket name");
But I am able to retrieve only summaries through the object in the list. Instead of summary I need metadata of each object, like the one provided by getObjectMetadata() method in S3Object class. How do I get that?
You can get four default metadata from objectSummary that returned from lisObject : Last Modified, Storage Type, Etag and Size.
To get metadata of objects, you need to perform HEAD object request on object or you call following method on your object :
GetObjectMetadataRequest(String bucketName, String key)
Look at this:
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(bucketName);
ObjectListing objectListing;
do {
objectListing = s3client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary
: objectListing.getObjectSummaries()) {
/** Default Metadata **/
Date dtLastModified = objectSummary.getLastModified();
String sEtag = objectSummary.getETag();
long lSize = objectSummary.getSize();
String sStorageClass = objectSummary.getStorageClass();
/** To get user defined metadata **/
ObjectMetadata objectMetadata = s3client.getObjectMetadata(bucketName, objectSummary.getKey());
Map userMetadataMap = objectMetadata.getUserMetadata();
Map rowMetadataMap = objectMetadata.getRawMetadata();
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
For more details on GetObjectMetadataRequest, look this link.
According to the Google Documents List Data API there is an option to copy documents:
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#CopyingDocs
But when I look in the GWT Java docu of the API, this menu point is missing.
http://code.google.com/apis/documents/docs/3.0/developers_guide_java.html
Do you know, if there is a method to copy GDocs documents in the Java GWT API? Which maybe is just not documented?
Looking in the python API I find the python method:
http://code.google.com/apis/documents/docs/3.0/developers_guide_python.html#CopyingDocs
I now managed to write my own copy request:
Replace t7Z3GLNuO641hOO737UH60Q by the documents key, you like to copy
String = "new File";
String userEmail= new CurrentUser ().getUser ().getEmail ();
String body = "<?xml version='1.0' encoding='UTF-8'?>"
+ "<entry xmlns=\"http://www.w3.org/2005/Atom\">"
+ "<id>t7Z3GLNuO641hOO737UH60Q</id>"
+ "<title>"+ title +"</title>"
+ "</entry>";
try {
GDataRequest gdr = docsService.createRequest(Service.GDataRequest.RequestType.INSERT,
new URL("https://docs.google.com/feeds/default/private/full/?xoauth_requestor_id="+ userEmail),
ContentType.ATOM);
gdr.setHeader("GData-Version", "3.0");
OutputStream requestStream = gdr.getRequestStream();
requestStream.write(body.getBytes());
log.info(gdr.toString());
gdr.execute();
}
[.. catch]