BiqQuery insert retry policy in Apache Beam - java

Apache Beam API has the following BiqQuery insert retry policies.
How Dataflow job behave if I specify retryTransientErrors?
shouldRetry provides an error from BigQuery and I can decide if I should retry. Where can I find expected error from BigQuery?
BiqQuery insert retry policies
https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.html
alwaysRetry - Always retry all failures.
neverRetry - Never retry any failures.
retryTransientErrors - Retry all failures except for known persistent errors.
shouldRetry - Return true if this failure should be retried.
Background
When my Cloud Dataflow job inserting very old timestamp (more than 1 year before from now) to BigQuery, I got the following error.
jsonPayload: {
exception: "java.lang.RuntimeException: java.io.IOException: Insert failed:
[{"errors":[{"debugInfo":"","location":"","message":"Value 690000000 for field
timestamp_scanned of the destination table fr-prd-datalake:rfid_raw.store_epc_transactions_cr_uqjp is outside the allowed bounds.
You can only stream to date range within 365 days in the past and 183 days in
the future relative to the current date.","reason":"invalid"}],
After the first error, Dataflow try to retry insert and it always rejected from BigQuery with the same error.
It did not stop so I added retryTransientErrors to BigQueryIO.Write step then the retry stopped.

How Dataflow job behave if I specify retryTransientErrors?
All errors are considered transient except if BigQuery says that the error reason is one of "invalid", "invalidQuery", "notImplemented"
shouldRetry provides an error from BigQuery and I can decide if I should retry. Where can I find expected error from BigQuery?
You can't since the errors are not visible to the caller. I'm not sure if this was done on purpose or whether Apache Beam should expose the errors so users can write their own retry logic.

Related

What should I try next in order to minimise/eliminate java.net.SocketTimeoutException: timeout spring retry

I get lots of events to process in RabbitMq and then those get forward to service 1 to process and after some processing the data, there is an internal call to a micro service2. However, I do get java.net.SocketTimeoutException: timeout frequently when I call service2, so I tried to increase timeout limit from 2s to 10 sec as a first trial and it did minimise the timeout exceptions but still lot of them are still there,
second change I made is removal of deprecated retry method of spring and replace the same with retryWhen method with back off and jitter factor introduced as shown below
.retryWhen(Retry.backoff(ServiceUtils.NUM_RETRIES, Duration.ofSeconds(2)).jitter(0.50)
.onRetryExhaustedThrow((retryBackoffSpec, retrySignal) -> {
throw new ServiceException(
ErrorBo.builder()
.message("Service failed to process after max retries")
.build());
}))
.onErrorResume(error -> {
// return and print the error only if all the retries have been exhausted
log.error(error.getMessage() + ". Error occurred while generating pdf");
return Mono.error(ServiceUtils
.returnServiceException(ServiceErrorCodes.SERVICE_FAILURE,
String.format("Service failed to process after max retries, failed to generate PDF)));
})
);
So my questions are,
I do get success for few service call and for some failure, does it mean some where there is still bottle neck for processing the request may be at server side that is does not process all the request.
Do I need to still increase timeout limit if possible
How do I make sure that there is no java.net.SocketTimeoutException: timeout
This issue has started coming recently. and it seems there is no change in ports or any connection level changes.
But still what all things I should check in order to make sure the connection level setting are correct. Could someone please guide on this.
Thanks in advance.

Handling exception resulting from transaction timeout when using Spring / Hibernate / Postgres

I'm attempting to provide a useful error message to users of a system where some queries may take a long time to execute. I've set a transaction timeout using Spring using #Transactional(timeout = 5).
This works as expected, but the caller of the annotated method receives a JpaSystemException: could not extract ResultSet exception (caused by GenericJDBCException: could not extract ResultSet, in turn caused by PSQLException: ERROR: canceling statement due to user request).
As far as I can tell, the exception is a result of a statement being cancelled by Hibernate and the JDBC driver once the timeout has been exceeded.
Is there any way I can determine that the exception was a result of the transaction timeout being exceeded so that I can provide an error message to the user about why their query failed?
The application is using Spring framework 4.2.9, Hibernate 4.3.11, and Postgres JDBC driver 9.4.1209. I realise these are quite old. If newer versions make handling this situation easier I would be interested to know.
How about you check the exception message of the cause of the cause against some regex pattern or just checking if the message contains the string like this?
exception.getCause().getCause().getMessage().equals("ERROR: canceling statement due to user request")

Send operation timed out when producing events to Azure eventhubs

I have a long-running application that uses azure eventhub SDK (5.1.0), continually publishing data to Azure event hub. The service threw the below exception after few days. What could be the cause of this and how we can overcome this?
Stack Trace:
Exception in thread "SendTimeout-timer" reactor.core.Exceptions$BubblingException: com.azure.core.amqp.exception.AmqpException: Entity(abc): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at reactor.core.Exceptions.bubble(Exceptions.java:173)
at reactor.core.publisher.Operators.onErrorDropped(Operators.java:612)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.onError(FluxTimeout.java:203)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onError(FluxHide.java:132)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:185)
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:565)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.azure.core.amqp.exception.AmqpException: Entity(abc-metrics): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:562)
... 2 more
I'm using Azure eventhub Java SDK 5.1.0
According to official documentation:
A TimeoutException indicates that a user-initiated operation is taking
longer than the operation timeout.
For Event Hubs, the timeout is specified either as part of the
connection string, or through ServiceBusConnectionStringBuilder. The
error message itself might vary, but it always contains the timeout
value specified for the current operation.
Common causes
There are two common causes for this error: incorrect
configuration, or a transient service error.
Incorrect configuration The operation timeout might be too small for
the operational condition. The default value for the operation timeout
in the client SDK is 60 seconds. Check to see if your code has the
value set to something too small. The condition of the network and CPU
usage can affect the time it takes for a particular operation to
complete, so the operation timeout should not be set to a small value.
Transient service error Sometimes the Event Hubs service can
experience delays in processing requests; for example, during periods
of high traffic. In such cases, you can retry your operation after a
delay, until the operation is successful. If the same operation still
fails after multiple attempts, visit the Azure service status site to
see if there are any known service outages.
If you are consistently seeing this error frequently, I would suggest to reach Azure support for a deeper look.

How to find status of records loaded when we forcefully intercepting batch execution by stopping mssql database

We are implementing connection or flush retry logic for database.
Auto-commit=true;
RetryPolicy retryPolicy = new RetryPolicy()
.retryOn(DataAccessException.class)
.withMaxRetries(maxRetry)
.withDelay(retryInterval, TimeUnit.SECONDS);
result = Failsafe.with(retryPolicy)
.onFailure(throwable -> LOG.warn("Flush failure, will not retry. {} {}"
, throwable.getClass().getName(), throwable.getMessage()))
.onRetry(throwable -> LOG.warn("Flush failure, will retry. {} {}"
, throwable.getClass().getName(), throwable.getMessage()))
.get(cntx -> {
return batch.execute();
});
we want to intercept storing, updating, inserting, deleting records by stopping mssql db service in backend. At some point even If we got org.jooq.exception.DataAccessException, some of the records in batch (subset of batch) are loaded into db.
Is there any way to find failed and successfully loaded records using jooq api?
The jOOQ API cannot help you here out of the box because such a functionality is definitely out of scope for the relatively low level jOOQ API, which helps you write type safe embedded SQL. It does not make any assumptions about your business logic or infrastructure logic.
Ideally, you will run your own diagnostic here. For example, you already have a BATCHID column which should make it possible to detect which records were inserted/updated with which process. When you re-run the batch, you need to detect that you've already attempted this batch, remember the previous BATCHID, and fetch the IDs of the previous attempt to do whatever needs to be done prior to a re-run.

kafka: Commit offsets failed with retriable exception. You should retry committing offsets

[o.a.k.c.c.i.ConsumerCoordinator] [Auto offset commit failed for group
consumer-group: Commit offsets failed with retriable
exception. You should retry committing offsets.] []
Why does this error come in kafka consumer? what does this mean?
The consumer properties I am using are:
fetch.min.bytes:1
enable.auto.commit:true
auto.offset.reset:latest
auto.commit.interval.ms:5000
request.timeout.ms:300000
session.timeout.ms:20000
max.poll.interval.ms:600000
max.poll.records:500
max.partition.fetch.bytes:10485760
What is the reason for that error to come? I am guessing the consumer is doing duplicated work right now (polling same message again) because of this error.
I am neither using consumer.commitAsync() or consumer.commitSync()
Consumer gives this error in case if it catches an instance of RetriableException.
The reasons for it might be various:
if coordinator is still loading the group metadata
if the group metadata topic has not been created yet
if network or disk corruption happens, or miscellaneous disk-related or network-related IOException occurred when handling a request
if server disconnected before a request could be completed
if the client's metadata is out of date
if there is no currently available leader for the given partition
if no brokers were available to complete a request
As you can see from the list above, all these errors could be temporary issues, that is why it is suggested to retry the request.

Categories