Java exception when creating multiple permissions on Google Drive documents - java

In our application we need to share multiple files to multiple users with Google Drive api.
We use batching provided by the java client library of the Google Drive api.
This runs already in the production but we get a lot of unclear exceptions from the Google Drive api:
Internal Error. User message: "An internal error has occurred which prevented the sharing of these item(s): "
We handle the exceptions and retry with an exponential backoff, but these errors cause big delays in the flow and usability of this application.
What is the reason these exceptions occur? How to avoid those?
It would be very helpful if we knew what is going wrong when these exceptions occur, so we can avoid it.
Some extra information:
Each batch contains 100 permission operations on different files.
Every minute a batch operation is called.
The code:
String fileId = "1sTWaJ_j7PkjzaBWtNc3IzovK5hQf21FbOw9yLeeLPNQ";
JsonBatchCallback<Permission> callback = new JsonBatchCallback<Permission>()
{
#Override
public void onFailure(GoogleJsonError e, HttpHeaders responseHeaders)
throws IOException {
System.err.println(e.getMessage());
}
#Override
public void onSuccess(Permission permission, HttpHeaders responseHeaders) throws IOException {
System.out.println("Permission ID: " + permission.getId());
}
};
BatchRequest batch = driveService.batch();
for(String email : emails) {
Permission userPermission = new Permission().setType("user").setRole("reader").setEmailAddress(email);
driveService.permissions().create(fileId, userPermission).setSendNotificationEmail(false).setFields("id").queue(batch, callback);
}
batch.execute();
the variable emails contains 100 email strings.

{
"code" : 500,
"errors" : [ {
"domain" : "global",
"message" : "Internal Error. User message: "An internal error has occurred which prevented the sharing of these item(s): fileame"",
"reason" : "internalError"
} ],
"message" : "Internal Error. User message: "An internal error has occurred which prevented the sharing of these item(s): filename""
}
Is basically flood protection. The normal recommendation is to Implementing exponential backoff
Exponential backoff is a standard error handling strategy for network
applications in which the client periodically retries a failed request
over an increasing amount of time. If a high volume of requests or
heavy network traffic causes the server to return errors, exponential
backoff may be a good strategy for handling those errors. Conversely,
it is not a relevant strategy for dealing with errors unrelated to
rate-limiting, network volume or response times, such as invalid
authorization credentials or file not found errors.
Used properly, exponential backoff increases the efficiency of
bandwidth usage, reduces the number of requests required to get a
successful response, and maximizes the throughput of requests in
concurrent environments.
Now here is were you are going to say but I am batching I cant do that. Yup batching falls under the same flood protection. Your batch is flooding the server. Yes I know it says you can send 100 requests and you probably can if the requests take enough time in between each request not to qualify as flooding but yours apparently doesn't.
My recommendation is you try cutting it down to say 10 request, and slowly stepping it up. Your not saving yourself anything using batching the quota usage will be the same as if you didn't batch it. You cant go any faster than the flood protection allows.

Related

How do I retrieve/copy lots of AWS S3 files without getting sporadic "Server failed to send complete response" exceptions?

I am copying millions of S3 files during a data migration and want to perform a lot of parallel copying. I am using the Java SDK v2 API. I keep getting rare, sporadic exceptions when I try to copy a lot of S3 files at once.
I most commonly get the following exception:
Unable to execute HTTP request: Server failed to send complete response. The channel was closed. This may have been done by the client (e.g. because the request was aborted), by the service (e.g. because there was a handshake error, the request took too long, or the client tried to write on a read-only socket), or by an intermediary party (e.g. because the channel was idle for too long).
I also get:
Unable to execute HTTP request: The channel was closed. This may have been done by the client (e.g. because the request was aborted), by the service (e.g. because there was a handshake error, the request took too long, or the client tried to write on a read-only socket), or by an intermediary party (e.g. because the channel was idle for too long).
We encountered an internal error. Please try again. (Service: S3, Status Code: 500 ...
Unable to execute HTTP request: Channel was closed before it could be written to.
Here is sample code which mostly reliably seems to reliably trigger the problem for me. (the problem may depend on things like bad/busy S3 server nodes, network traffic, throttling or race conditions so it is not easy to 100% reliably reproduce the problem)
NettyNioAsyncHttpClient.Builder asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(300)
.maxPendingConnectionAcquires(500)
.connectionTimeout(Duration.ofSeconds(10))
.connectionAcquisitionTimeout(Duration.ofSeconds(60))
.readTimeout(Duration.ofSeconds(60));
S3AsyncClient s3Client = S3AsyncClient.builder()
.credentialsProvider(awsCredentialsProvider)
.region(Region.US_EAST_1)
// You get the same failures even with retries -- it just takes longer
.overrideConfiguration(config -> config.retryPolicy(RetryPolicy.none()).build())
.httpClientBuilder(asyncHttpClientBuilder)
.build();
List<CompletableFuture> futures = new ArrayList<>();
for (int i = 1; i <= 500; i++) {
String key = "zerotestfile";
Path outFile = Paths.get("/tmp/experiment/").resolve(key + "-" + i);
outFile.getParent().toFile().mkdirs();
if (outFile.toFile().exists()) {
outFile.toFile().delete();
}
log.info("Downloading: {} ({})", key, i);
GetObjectRequest request = GetObjectRequest.builder()
.bucket("my-test-bucket")
.key(key)
.build();
CompletableFuture<GetObjectResponse> future = s3Client.getObject(request, AsyncResponseTransformer.toFile(outFile))
.exceptionally(exception -> {
log.error("Error", exception);
return null;
});
futures.add(future);
}
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
500k file produced via: dd if=/dev/zero of=zerotestfile bs=1024 count=500
Using a retry-condition (the default) actually appears to fix the example above. But in my actual migration copying millions of files, I use a retry-condition, which helps, but I still encounter these exact exceptions produced by the example.
Additional details: My actual migration logic uses cross-region CopyObject calls. In order to make the problem simpler, I switched the example to single-region GetObject requests. I can get it to produce similar errors to the above code but I have to perform 2500 copies with maxConcurrency 2000.
I simplified my S3 config and kept only what prevented the above example from dying. I fixed the following errors by making appropriate the config changes:
Error: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time.
Add: .connectionAcquisitionTimeout(Duration.ofSeconds(60))
Error: Unable to execute HTTP request: connection timed out
Add: .connectionTimeout(Duration.ofSeconds(10))
Error: ReadTimeoutException: null
Add: .readTimeout(Duration.ofSeconds(60))
Error: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate. Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.
Add: .maxPendingConnectionAcquires(500)
Sources: My example is originally based (but heavily modified) on a code-snippet from a bug-report in the aws java SDK which is apparently fixed: https://github.com/aws/aws-sdk-java-v2/issues/1122
Note that various other related problems (normally also AWS Java SDK v2) get similar exceptions. I welcome any answers/comments to related problems here. If they are due to AWS SDK bugs, people open up github bug reports. See https://github.com/aws/aws-sdk-java-v2/issues
To make the retry-logic work properly, I had to add:
.retryCapacityCondition(null)
(see the method documentation which specifies that you should pass in "null" to disable it)
The default behaviour is to disable retries if too many errors are hit globally by the s3 client. The problem is that I am performing massive copying and regularly hit errors and I still want to retry.
This solution seems almost obvious now but it took me over 2 days to figure this out particularly due to how hard the errors are to reproduce reliably. 99.99% of the time, it works. But my migration always failed on and skipped about a hundred files in a million. I made my own manual retry logic (because the s3-retry wasn't fixing the problem) which worked but I searched a bit harder and found this better solution.
I found it helpful to use a custom retry-policy class which logs what it is doing so I could see clearly that it was not working as I thought it should. Once I added this, I could see that in the problem cases, it was not doing my 120 retries (once every 30 seconds) at all. That's when I found retryCapacityCondition. The custom logging retry-policy:
.overrideConfiguration(config ->
config.retryPolicy(
RetryPolicy.builder()
.retryCondition(new AlwaysRetryCondition())
.retryCapacityCondition(null)
.build()
).build()
)
private static class AlwaysRetryCondition implements RetryCondition {
private final RetryCondition defaultRetryCondition;
public AlwaysRetryCondition() {
defaultRetryCondition = RetryCondition.defaultRetryCondition();
}
#Override
public boolean shouldRetry(RetryPolicyContext context) {
String exceptionMessage = context.exception().getMessage();
Throwable cause = context.exception().getCause();
log.debug(
"S3 retry: shouldRetry retryCount=" + context.retriesAttempted()
+ " defaultRetryCondition=" + defaultRetryCondition.shouldRetry(context)
+ " httpstatus=" + context.httpStatusCode()
+ " " + context.exception().getClass().getSimpleName()
+ (cause != null ? " cause=" + cause.getClass().getSimpleName() : "")
+ " message=" + exceptionMessage
);
return true;
}
#Override
public void requestWillNotBeRetried(RetryPolicyContext context) {
log.debug("S3 retry: requestWillNotBeRetried retryCount=" + context.retriesAttempted());
}
#Override
public void requestSucceeded(RetryPolicyContext context) {
if (context.retriesAttempted() > 0) {
log.debug("S3 retry: requestSucceeded retryCount=" + context.retriesAttempted());
}
}
}
For reference, this is the config I use:
NettyNioAsyncHttpClient.Builder asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(500)
.maxPendingConnectionAcquires(10000)
.connectionMaxIdleTime(Duration.ofSeconds(600))
.connectionTimeout(Duration.ofSeconds(20))
.connectionAcquisitionTimeout(Duration.ofSeconds(60))
.readTimeout(Duration.ofSeconds(120));
// Add retry behaviour
final long CLIENT_TIMEOUT_MILLIS = 600000;
final int NUMBER_RETRIES = 60;
final long RETRY_BACKOFF_MILLIS = 30000;
ClientOverrideConfiguration overrideConfiguration = ClientOverrideConfiguration.builder()
.apiCallTimeout(Duration.ofMillis(CLIENT_TIMEOUT_MILLIS))
.apiCallAttemptTimeout(Duration.ofMillis(CLIENT_TIMEOUT_MILLIS))
.retryPolicy(RetryPolicy.builder()
.numRetries(NUMBER_RETRIES)
.backoffStrategy(
FixedDelayBackoffStrategy.create(Duration.of(RETRY_BACKOFF_MILLIS, ChronoUnit.MILLIS))
)
.throttlingBackoffStrategy(BackoffStrategy.none())
.retryCondition(new AlwaysRetryCondition())
// retryCapacityCondition(null) fixes the rare s3-copy-errors
// this global max-retries was kicking in and preventing individual copy-requests from retrying
.retryCapacityCondition(null)
.build()
).build();
S3AsyncClient s3Client = S3AsyncClient.builder()
.credentialsProvider(awsCredentialsProvider)
.region(Region.US_EAST_1)
.httpClientBuilder(asyncHttpClientBuilder)
.overrideConfiguration(overrideConfiguration)
.build();

Hyperledger Fabric - Java-SDK - Future completed exceptionally: sendTransaction

I'm running an HL Fabric private network and submitting transactions to the ledger from a Java Application using Fabric-Java-Sdk.
Occasionally, like 1/10000 of the times, the Java application throws an exception when I'm submitting the transaction to the ledger, like the message below:
ERROR 196664 --- [ Thread-4] org.hyperledger.fabric.sdk.Channel
: Future completed exceptionally: sendTransaction
java.lang.IllegalArgumentException: The proposal responses have 2
inconsistent groups with 0 that are invalid. Expected all to be
consistent and none to be invalid. at
org.hyperledger.fabric.sdk.Channel.doSendTransaction(Channel.java:5574)
~[fabric-sdk-java-2.1.1.jar:na] at
org.hyperledger.fabric.sdk.Channel.sendTransaction(Channel.java:5533)
~[fabric-sdk-java-2.1.1.jar:na] at
org.hyperledger.fabric.gateway.impl.TransactionImpl.commitTransaction(TransactionImpl.java:138)
~[fabric-gateway-java-2.1.1.jar:na] at
org.hyperledger.fabric.gateway.impl.TransactionImpl.submit(TransactionImpl.java:96)
~[fabric-gateway-java-2.1.1.jar:na] at
org.hyperledger.fabric.gateway.impl.ContractImpl.submitTransaction(ContractImpl.java:50)
~[fabric-gateway-java-2.1.1.jar:na] at
com.apidemoblockchain.RepositoryDao.BaseFunctions.Implementations.PairTrustBaseFunction.sendTrustTransactionMessage(PairTrustBaseFunction.java:165)
~[classes/:na] at
com.apidemoblockchain.RepositoryDao.Implementations.PairTrustDataAccessRepository.run(PairTrustDataAccessRepository.java:79)
~[classes/:na] at java.base/java.lang.Thread.run(Thread.java:834)
~[na:na]
While my submitting method goes like this:
public void sendTrustTransactionMessage(Gateway gateway, Contract trustContract, String payload) throws TimeoutException, InterruptedException, InvalidArgumentException, TransactionException, ContractException {
// Prepare
checkIfChannelIsReady(gateway);
// Execute
trustContract.submitTransaction(getCreateTrustMethod(), payload);
}
I'm using a 4 org network with 2 peers each and I am using 3 channels, one for each chaincode DataType, in order to keep the things clean.
I think that the error coming from the Channel doesn't make sense because I am using the Contract to submit it...
Like I'm opening the gateway and then I keep it open for continuously submit the txs.
try (Gateway gateway = getBuilder(getTrustPeer()).connect()) {
Contract trustContract = gateway.getNetwork(getTrustChaincodeChannelName()).getContract(getTrustChaincodeId(), getTrustChaincodeName());
while (!terminateLoop) {
if (message) {
String payload = preparePayload();
sendTrustTransactionMessage(gateway, trustContract, payload);
}
...
wait();
}
...
}
EDIT:
After reading #bestbeforetoday advice, I've managed to catch the ContractException and analyze the logs. Still, I don't fully understand where might be the bug and, therefore, how to fix it.
I'll add 3 prints that I've taken to the ProposalResponses received in the exception and a comment after it.
ProposalResponses-1
ProposalResponses-2
ProposalResponses-3
So, in the first picture, I can see that 3 proposal responses were received at the exception and the exception cause message says:
"The proposal responses have 2 inconsistent groups with 0 that are invalid. Expected all to be consistent and none to be invalid."
In pictures, 2/3 is represented the content of those responses and I notice that there are 2 fields saving null value, namely "ProposalRespondePayload" and "timestamp_", however, I don't know if those are the "two groups" referred at the message cause of the exception.
Thanks in advance...
It seems that, while the endorsing peers all successfully endorsed your transaction proposal, those peer responses were not all byte-for-byte identical.
There are several things that might differ, including read/write sets or the value returned from the transaction function invocation. There are several reasons why differences might occur, including non-deterministic transaction function implementation, different transaction function behaviour between peers, or different ledger state at different peers.
To figure out what caused this specific failure you probably need to look at the peer responses to identify how they differ. You should be getting a ContractException thrown back from your transaction submit call, and this should allow you to access the proposal responses by calling e.getProposalResponses():
https://hyperledger.github.io/fabric-gateway-java/release-2.2/org/hyperledger/fabric/gateway/ContractException.html#getProposalResponses()

Google API throwing Rate Limit Exceeded 403 exception for CDN cache invalidation

We are caching our pages and content in Google CDN.
Google has provided us an API to invalidate cache for a particular page/path.
Our website is built using a CMS called AEM(Adobe Experience Manager), this CMS supports constant page/content updation eg. we may update what is shown on our https://our-webpage/homepage.html twice in a day. When such an operation is done there is a need to flush the cache at the Google CDN for "homepage.html".
Such kind of an activity is very common, meaning we need to send several cache invalidation requests(thousands) in a day.
We are sending so many invalidation requests that after sometime we get this error
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "usageLimits",
"message" : "Rate Limit Exceeded",
"reason" : "rateLimitExceeded"
} ],
"message" : "Rate Limit Exceeded"
}
How do we solve this?
I've read this page https://developers.google.com/drive/api/v3/handle-errors
It mentions batching requests.
How do I send invalidation requests for multiple pages to Google CDN in one batch?
Or Is it possible to increase or set the API flush call limit to a higher number per day.
Right now if we have 100 pages to flush from CDN we make the below HTTP call 100 times(one for each page).
CacheInvalidationRule requestBody = new CacheInvalidationRule();
// IMPORTANT
requestBody.setPath(pagePath);
Compute computeService = createComputeService();
Compute.UrlMaps.InvalidateCache request =
computeService.urlMaps().invalidateCache(projectName, urlMap, requestBody);
Operation response = request.execute();
if(LOG.isDebugEnabled()) {
LOG.debug("Google CDN Flush Response JSON :: {}",response);
}
LOG.info("Google CDN Flush Invalidation for Page Path {}:: Response Status Code:: {}",pagePath,response.getStatus());
We set the page to flush in requestBody.setPath(pagePath);
Can we do this in a more efficient way, like sending all pages as an array or strings in one HTTP call?
Like :
requestBody.setPath(pagePath);
Where
pagePath="['/homepage.html','/videos.html','/sports/basketball.html','/tickets.html','/faqs.html']";
Rate Limit Exceeded is flood protection you are going to fast slow down your requests.
Implement exponential back off for retrying the requests.
You can periodically retry a failed request over an increasing amount of time to handle errors related to rate limits, network volume, or response time. For example, you might retry a failed request after one second, then after two seconds, and then after four seconds. This method is called exponential backoff and it is used to improve bandwidth usage and maximize throughput of requests in concurrent environments. When using exponential backoff, consider the following:
Start retry periods at least one second after the error.
If the attempted request introduces a change, such as a create request, add a check to make sure nothing is duplicated. Some errors, such as invalid authorization credentials or "file not found" errors, aren’t resolved by retrying the request.
Batching wont help much your still going to be limited to the same issues with the rate limit i have even seen rate limit errors when batching.
Kindly note your link is from the Google drive api im not even sure Cloud CDN supports batching of requests.
Wouldn't it be better to aggregate several updates on AEM side and send only one request to the CDN after a max period of time and / or a max amount of changes?
I mean, if you change your homepage on AEM, usually you would invalidate all the subpages as well (navigation might change, ...).
Isn't there a possibility for the google cdn to invalidate a tree or subtree?
At least that's what I would extract from this documentation https://cloud.google.com/sdk/gcloud/reference/compute/url-maps/invalidate-cdn-cache

Google calendar API v3 returns (503 backendError) on channel creation

I'm using Calendar API client library for Java to watch channels and get push notifications. Sometimes, when I try to create a channel on Google, it returns the following error response:
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Failed to create channel",
"reason" : "backendError"
} ],
"message" : "Failed to create channel"
}
There is nothing about handling this error in the documentation:
https://developers.google.com/google-apps/calendar/v3/errors
However, I guess it could happen due to high number of request are sent to Google and it refuses the connection. Maybe, here I need to perform retry after some time.
The question is what's the correct way to handle this error and to start watching the desired channel?
The route cause of this issue is probably a heavy network traffic.
Google calendar API suggests the solution of exponential backoff implementation for that kind of errors.
An exponential backoff is an algorithm that repeatedly attempts to execute some action until that action has succeeded, waiting an amount of time that grows exponentially between each attempt, up to some maximum number of attempts.
You could find implementation ideas here.

error 204 in a Google App Engine API in java

We have an API with Googe App Engine. The API consist on a search engine, when a user requests a productID the API returns a json with a group of other productIDs (with a specific criteria). This is the current configuration:
<instance-class>F4_1G</instance-class>
<automatic-scaling>
<min-idle-instances>3</min-idle-instances>
<max-idle-instances>automatic</max-idle-instances>
<min-pending-latency>automatic</min-pending-latency>
<max-pending-latency>automatic</max-pending-latency>
</automatic-scaling>
We use app_engine_release=1.9.23
The process does as follows. We have two calls to datastore and a call with urlfetch (to an external API).
The problem consist on that from time to time we receive en error 204 with this trace:
ms=594 cpu_ms=0 exit_code=204 app_engine_release=1.9.23
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. (Error code 204)
This is what we got in the client:
{
"error": {
"errors": [
{
"domain": "global",
"reason": "backendError",
"message": ""
}
],
"code": 503,
"message": ""
}
}
We changed the number of resident instances from 3 to 7 and we got the same error. Also the errors occur in the same instances. We see 4 errors within a very small amount of time.
We found that the problem was with the urlfecth call. If we put a high timeout, then it returns a lot of errors.
any idea why this is happening???
I believe I have found the problem. The problem was related to the urlfetch call. I did many tests until I isolate the problem. When i did calls only to datastore everything worked as expected. However when I added the urlfetch call it produced the 204 errors. It happened always so I believe that could be a bug.
What I did to get rid of the error was to remove the cloud end point from Google and use a basic servlet. I found that mixing the servlet with the urlfetch call we don't get the error, therefore the problem might not be only related to urlfetch but a combination of urlfetch and Google cloud end point.

Categories