Why does Datastore produce Timeout while fetching URL? - java

For a short time today, my code running in Flexible Environment "compat" using Google Cloud Datastore API experienced java.net.SocketTimeoutException: Timeout while fetching URL on inserting an Item to Datastore in another GAE project.
In addition, Dataflow failed to insert Items at that time; almost certainly the same problem.
This also occurs on simple key queries, so it is not a problem of heavy data.
Before and after this, a lot of other data did get inserted correctly; including a rerun on this same data.
Googling the error suggests it can be caused by downtime in the Google Cloud, but the Google Cloud Status Dashboard shows green.
What caused this? How can we avoid it in the future?
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities run: ERROR in CopyEntities(commerceDocs/RFQ) java.lang.RuntimeException: com.google.cloud.datastore.DatastoreException: I/O error at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.retryIfAllowed(GCloudApiDSBackup.java:963) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.putEntities(GCloudApiDSBackup.java:857) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.retryIfAllowed(GCloudApiDSBackup.java:959) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.putEntities(GCloudApiDSBackup.java:857) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.putEntitiesByParts(GCloudApiDSBackup.java:991) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.run(GCloudApiDSBackup.java:801) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745) Caused by: com.google.cloud.datastore.DatastoreException: I/O error at
com.google.cloud.datastore.spi.DefaultDatastoreRpc.translate(DefaultDatastoreRpc.java:105) at
com.google.cloud.datastore.spi.DefaultDatastoreRpc.commit(DefaultDatastoreRpc.java:133) at
com.google.cloud.datastore.DatastoreImpl$4.call(DatastoreImpl.java:390) at
com.google.cloud.datastore.DatastoreImpl$4.call(DatastoreImpl.java:387) at
com.google.cloud.RetryHelper.doRetry(RetryHelper.java:179) at
com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:244) at
com.google.cloud.datastore.DatastoreImpl.commit(DatastoreImpl.java:386) at
com.google.cloud.datastore.DatastoreImpl.commitMutation(DatastoreImpl.java:380) at
com.google.cloud.datastore.DatastoreImpl.put(DatastoreImpl.java:340) at
com.freightos.backup.datastore.gcloudapi.GCloudApiDSBackup$CopyEntities.putEntities(GCloudApiDSBackup.java:836) ... 9 more Caused by: com.google.datastore.v1.client.DatastoreException: I/O error, code=UNAVAILABLE at
com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126) at
com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:95) at
com.google.datastore.v1.client.Datastore.commit(Datastore.java:84) at
com.google.cloud.datastore.spi.DefaultDatastoreRpc.commit(DefaultDatastoreRpc.java:131) ... 17 more Caused by: java.net.SocketTimeoutException: Timeout while fetching URL: https://datastore.googleapis.com/v1/projects/freightos-prod-backup2:commit at
com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:173) at
com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:45) at
com.google.api.client.extensions.appengine.http.UrlFetchRequest.execute(UrlFetchRequest.java:74) at
com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) at
com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:87) ... 19 more

It looks like UrlFetchTransport is being used when it shouldn't be.
I've filed this issue to fix the google-cloud-java library.
In the mean time, you should be able to force it to use NetHttpTransport when you construct the DatastoreOptions object:
import com.google.api.client.http.HttpTransport;
import com.google.api.client.http.javanet.NetHttpTransport;
import com.google.auth.http.HttpTransportFactory;
DatastoreOptions.newBuilder()
.setHttpTransportFactory(new HttpTransportFactory() {
#Override
public HttpTransport create() {
return new NetHttpTransport();
}
})
...

Related

Getting java.io.IOException: Error getting access token for service account: connect timed out while making a call to datastore

My application works fine locally and I'm able to connect to GCP Datastore from local. But when deployed to a server, I'm getting the below exception.
Caused by: com.google.datastore.v1.client.DatastoreException: I/O error\n\t
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:136)\n\t
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:105)\n\t
at com.google.datastore.v1.client.Datastore.beginTransaction(Datastore.java:79)\n\t
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.beginTransaction(HttpDatastoreRpc.java:153)\n\t... 92 common frames omitted\nCaused by: java.io.IOException: Error getting access token for service account: connect timed out\n\t
at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:444)\n\t
at com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:157)\n\t
at com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:145)\n\t
at com.google.auth.oauth2.ServiceAccountCredentials.getRequestMetadata(ServiceAccountCredentials.java:603)\n\t
at com.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:91)\n\t
at com.google.cloud.http.HttpTransportOptions$1.initialize(HttpTransportOptions.java:159)\n\t
at com.google.cloud.http.CensusHttpModule$CensusHttpRequestInitializer.initialize(CensusHttpModule.java:109)\n\t
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc$1.initialize(HttpDatastoreRpc.java:91)\n\t
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:91)\n\t... 94 common frames omitted\nCaused by: java.net.SocketTimeoutException: connect timed out\n\t
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)\n\t
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)\n\t
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)\n\t
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)\n\t
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)\n\t
at java.base/java.net.Socket.connect(Socket.java:591)\n\t
at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:285)\n\t
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)\n\t
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)\n\t
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)\n\t
at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265)\n\t
at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372)\n\t
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)\n\t
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181)\n\t
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075)\n\t
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)\n\t
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1356)\n\t
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1331)\n\t
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:241)\n\t
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:113)\n\t
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)\n\t
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)\n\t
at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:441)
Edit -
Scopes under this credential
[https://www.googleapis.com/auth/pubsub,
https://www.googleapis.com/auth/spanner.admin,
https://www.googleapis.com/auth/spanner.data,
https://www.googleapis.com/auth/datastore,
https://www.googleapis.com/auth/sqlservice.admin,
https://www.googleapis.com/auth/devstorage.read_only,
https://www.googleapis.com/auth/devstorage.read_write,
https://www.googleapis.com/auth/cloudruntimeconfig,
https://www.googleapis.com/auth/trace.append,
https://www.googleapis.com/auth/cloud-platform,
https://www.googleapis.com/auth/cloud-vision,
https://www.googleapis.com/auth/bigquery,
https://www.googleapis.com/auth/monitoring.write]
Any lead will be appreciated.
Thank you in advance!
Turned out that it was a connectivity issue. Our server (in AWS) didn't have the right to access datastore.

Java multi threaded application - getting "Bad File Descriptor" exception on Hive intermittently

I know this kind of question have been asked previously, but I still don't get solution after reading their posts, so I decide to post this question again from here.
I Am working on Java multi-threaded application where I am trying to run HQL queries using JDBC on Hive environment. I have bunch of hive-sql queries and i am executing them on Hive in parallel with multiple threads and I am getting following exception when queries count more (for example, if i am running more than 100 queries). can some one please check this and help me on this?
2020-06-16 06:00:45,314 ERROR [main]: Terminal exception
java.lang.Exception: Map step agg_cas_auth_reinstate_derive failed.
at com.mine.idn.magellan.ParallelExecGraph.execute(ParallelExecGraph.java:198)
at com.mine.idn.magellan.WarehouseSession.executeMap(WarehouseSession.java:332)
at com.mine.idn.magellan.StandAloneEnv.execute(StandAloneEnv.java:872)
at com.mine.idn.magellan.StandAloneEnv.execute(StandAloneEnv.java:778)
at com.mine.idn.magellan.StandAloneEnv.executeAndExit(StandAloneEnv.java:642)
at com.mine.idn.magellan.StandAloneEnv.main(StandAloneEnv.java:77)
Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. java.io.IOException: Bad file descriptor
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.IOException: Bad file descriptor
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2850)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2685)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2591)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1077)
at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2007)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:479)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:469)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:190)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:601)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:599)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.mapred.JobClient.getJobUsingCluster(JobClient.java:599)
at org.apache.hadoop.mapred.JobClient.getJobInner(JobClient.java:609)
at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:639)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:295)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:559)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:425)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:201)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:79)
Caused by: java.io.IOException: Bad file descriptor
at java.io.FileInputStream.close0(Native Method)
at java.io.FileInputStream.access$000(FileInputStream.java:49)
at java.io.FileInputStream$1.close(FileInputStream.java:336)
at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
at java.io.FileInputStream.close(FileInputStream.java:334)
at java.io.BufferedInputStream.close(BufferedInputStream.java:483)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2676)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2661)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2741)
... 22 more
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
at com.mine.idn.magellan.WarehouseSession.executeMapStep(WarehouseSession.java:797)
at com.mine.idn.magellan.WarehouseSession.access$000(WarehouseSession.java:23)
at com.mine.idn.magellan.WarehouseSession$ParallelExecResources.executeMapStep(WarehouseSession.java:91)
at com.mine.idn.magellan.ParallelExecGraph$Node.run(ParallelExecGraph.java:85)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
What i dont understand is, why hadoop framework throwing - Bad File Descriptor Exception? My Java code invoking Hadoop-Hive code and its throwing this exception.
Also one more thing is this issue is intermittent, not consistent. If i re-run the same application, most of the cases, it went through.
Thank you for

Caused by: java.lang.UnsupportedOperationException: BigQuery source must be split before being read

I am trying to read from bigquery using Java BigqueryIO.read method. but getting below error.
public POutput expand(PBegin pBegin) {
final String queryOperation = "select query";
return pBegin
.apply(BigQueryIO.readTableRows().fromQuery(queryOperation));
}
2020-06-08 19:32:01.391 ISTError message from worker: java.io.IOException: Failed to start reading from source: org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter#77f0db34 org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:792) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:361) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1320) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:151) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1053) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:748) Caused by: java.lang.UnsupportedOperationException: BigQuery source must be split before being read org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.createReader(BigQuerySourceBase.java:173) org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.advance(UnboundedReadFromBoundedSource.java:467) org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.access$300(UnboundedReadFromBoundedSource.java:446) org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.advance(UnboundedReadFromBoundedSource.java:298) org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.start(UnboundedReadFromBoundedSource.java:291) org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:787) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:361) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194) org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159) org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1320) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:151) org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1053) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:748)
I suppose that this issue might be connected with missing tempLocation pipeline execution parameter when you are using DataflowRunner for cloud execution.
According to the documentation:
If tempLocation is not specified and gcpTempLocation is, tempLocation will not be populated.
Since this is just my presumption, I'll also encourage you to inspect native Apache Beam runtime logs to expand the overall issue evidence, as long as Stackdriver logs don't reflect a full picture of the problem.
There was raised a separate Jira tracker thread BEAM-9043, indicating this vaguely outputted error description.
Feel free to append more certain information to you origin question for any further concern or essential updates.

Azure Blob Storage with Apache Flink 1.10

I am trying to use Azure Blob Storage with Apache Flink 1.10 for checkpointing purposes.
I followed all the instructions mentioned in [Flink Documentation][1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/filesystems/azure.html
Step 1:
mkdir ./plugins/azure-fs-hadoop
cp ./opt/flink-azure-fs-hadoop-1.10.0.jar ./plugins/azure-fs-hadoop/
Step 2:
This is what I have in flink-conf.xml
#Azure Storage Key
**fs.azure.account.key.<storage-account>.blob.core.windows.net:xxxxxxxxxxxxxxxxxx**
Step 3:
Use Azure Blob storage for checkpointing
This is what I have in my flink job
final StateBackend stateBackend = new FsStateBackend("wasb://flink-blob#$<storage-account>.blob.core.windows.net/checkpoint");
I am not sure if i missed anything here but when I submit the job, I get below Exception (AskTimeoutException from Actor).
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1741)
at org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620)
at com.example.flink.checkpointing.CheckpointExample.main(CheckpointExample.java:78)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
... 8 more
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1736)
... 17 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:359)
at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1085)
at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal server error., <Exception on server side:
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#75666936]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
at java.base/java.lang.Thread.run(Thread.java:834)
End of exception on server side>]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:390)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:374)
at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
... 4 more```
I think your problem is twofold. The true failure cause is hidden because of the AskTimeoutException. This problem has been solved with FLINK-16018 which will be released with Flink 1.10.1. The problem is that the timeout value is too aggressive so that a long lasting job submission will fail on the client side.
For the true failure cause, I would recommend to take a look at Flink's jobmanager.log. It should contain information about what went wrong. I would suspect that there is a misconfiguration of the Azure Blob Storage.
That's correct. You must set your checkpoints.dir and savepoints.dir in the following format and use
fs.azure.account.key.<storage-account-name>.blob.core.windows.net:<key>
wasbs://<container>#<storage-account-name>.blob.core.windows.net/<directory>/

How to debug why the map job fails after multiple retries

I wrote a mapreduce job to scan an hbase table for a certain time range to count certain elements we need for analysis.
Mappers in the MR job keeps failing but I don't know why. Seems like each time I run the job, a different number of mappers fail. The YARN log (see below) from Cloudera manager isn't helpful in pointing what the problem is, although, someone said I might be running out of memory.
It seems to retry multiple times but each time it fails. What do I need to do to make it stop failing or how can I log things to help me better determine what is happening?
Below is a log from YARN for one of the mappers that failed.
Error: org.apache.hadoop.hbase.client.RetriesExhaustedException:
Failed after attempts=36, exceptions: Thu Jun 15 16:26:57 PDT 2017,
null, java.net.SocketTimeoutException: callTimeout=60000,
callDuration=60301: row '152_p3401.db161139.sjc102.dbi_1496271480' on
table 'dbi_based_data' at
region=dbi_based_data,151_p3413.db162024.iad4.dbi_1476974340,1486675565213.d83250d0682e648d165872afe5abd60e., hostname=hslave35118.ams9.mysecretdomain.com,60020,1483570489305,
seqNum=19308931 at
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:207)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403)
at
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:236)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.nextKeyValue(TableInputFormatBase.java:216)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused
by: java.net.SocketTimeoutException: callTimeout=60000,
callDuration=60301: row '152_p3401.db161139.sjc102.dbi_1496271480' on
table 'dbi_based_data' at
region=dbi_based_data,151_p3413.db162024.iad4.dbi_1476974340,1486675565213.d83250d0682e648d165872afe5abd60e., hostname=hslave35118.ams9.mysecretdomain.com,60020,1483570489305,
seqNum=19308931 at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by:
java.io.IOException: Call to
hslave35118.ams9.mysecretdomain.com/10.216.35.118:60020 failed on
local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call id=12, waitTime=60001, operationTimeout=60000 expired. at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:291)
at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:226)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:331)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:219)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:64)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
... 4 more Caused by:
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=12,
waitTime=60001, operationTimeout=60000 expired. at
org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73) at
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
... 13 more
So it looks like for my case I needed to extend the timeout setting. In my Java program I had to add the following lines to make the exception go away:
conf.set("hbase.rpc.timeout","90000");
conf.set("hbase.client.scanner.timeout.period","90000");
The answer was found on this link on Cloudera's site

Categories