Flink cluster deployed with "Fencing token not set exception"

Flink cluster deployed with "Fencing token not set exception" - java

What does mean this exception?
I am trying to deploy flink cluster(v.1.5.2) with 3 nodes in HA mode (zookeeper).
I have following flink-conf.yaml settings:
high-availability: zookeeper
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: {node1_ip}:2181,{node2_ip}:2181,{node3_ip}:2181
high-availability.jobmanager.port: 50010
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.namespace: /default_ns
Zookeeper cluster is running.
After start-cluster.sh executed I have only one working node. Another 2 nodes return
{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]} from web UI
and exception in flink-root-standalongsession-.log:
2018-08-07 18:55:22,081 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Could not retrieve the redirect address.
java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(aec7a76447f8d44131605f5c10fb4fdc, LocalRpc
...
<------>at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
<------>at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(aec7a76447f8d44131605f5c10fb4fdc, LocalRpcInvocation(requestRestAddress(T
<------>at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:59)

Related

Failed to setup gcp repository using elasticsearch operator

I'm launching 3 elastic nodes using elastic operator and i tried to set up automated snapshots for these instances.
I followed this doc
I minified the json of the service account key and created a file called gcs.client.default.credentials_file with no file extension and added this file to kubernetes secrets.
And added the secureSettings.secretName field to the spec of the elastic cluster and added the secret name to it which was gcs-credentials
But i get this error on the logs
{"#timestamp":"2022-12-26T18:45:40.037Z", "log.level":"ERROR", "message":"fatal exception while booting Elasticsearch", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.bootstrap.Elasticsearch","elasticsearch.node.name":"elasticsearch-cluster-es-node-1","elasticsearch.cluster.name":"elasticsearch-cluster","error.type":"java.lang.IllegalStateException","error.message":"failed to load plugin class [org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin]","error.stack_trace":"java.lang.IllegalStateException: failed to load plugin class [org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin]\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.loadPlugin(PluginsService.java:607)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.loadBundle(PluginsService.java:482)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:290)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:159)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.lambda$getPluginsServiceCtor$14(PluginsService.java:634)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.node.Node.<init>(Node.java:406)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.node.Node.<init>(Node.java:316)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:214)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:214)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:67)\nCaused by: java.lang.reflect.InvocationTargetException\n\tat java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:79)\n\tat java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)\n\tat java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:484)\n\tat org.elasticsearch.server#8.5.0/org.elasticsearch.plugins.PluginsService.loadPlugin(PluginsService.java:600)\n\t... 9 more\nCaused by: java.lang.IllegalArgumentException: failed to load GCS client credentials from [gcs.client.default.credentials_file]\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageClientSettings.loadCredential(GoogleCloudStorageClientSettings.java:265)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageClientSettings.getClientSettings(GoogleCloudStorageClientSettings.java:221)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageClientSettings.load(GoogleCloudStorageClientSettings.java:209)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.reload(GoogleCloudStoragePlugin.java:88)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.<init>(GoogleCloudStoragePlugin.java:36)\n\tat java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:67)\n\t... 12 more\nCaused by: java.io.IOException: Invalid PKCS#8 data.\n\tat com.google.auth.oauth2.ServiceAccountCredentials.privateKeyFromPkcs8(ServiceAccountCredentials.java:496)\n\tat com.google.auth.oauth2.ServiceAccountCredentials.fromPkcs8(ServiceAccountCredentials.java:474)\n\tat com.google.auth.oauth2.ServiceAccountCredentials.fromJson(ServiceAccountCredentials.java:212)\n\tat com.google.auth.oauth2.ServiceAccountCredentials.fromStream(ServiceAccountCredentials.java:548)\n\tat com.google.auth.oauth2.ServiceAccountCredentials.fromStream(ServiceAccountCredentials.java:520)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageClientSettings.lambda$loadCredential$13(GoogleCloudStorageClientSettings.java:257)\n\tat java.base/java.security.AccessController.doPrivileged(AccessController.java:569)\n\tat org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:33)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageClientSettings.loadCredential(GoogleCloudStorageClientSettings.java:256)\n\t... 17 more\n"}
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/elasticsearch-cluster.log

Try adding the following lines to your configuration (on each Elasticsearch):
elasticsearch01:
image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
...
ulimits:
memlock:
soft: -1
hard: -1
Also check this link on Elasticsearch for more detailed information.

FusionAuth incomplete reindex with AWS Elasticsearch

I am migrating from a self-hosted Elasticsearch FusionAuth search to an AWS Elasticsearch Service solution.
I have a new FusionAuth app EC2 instance reading from the in-use database that is configured to use the new Elasticsearch service.
On triggering a reindex from the new app instance I see that only around 60k or 62.5k documents are being written to the new index when I am expecting roughly 6mil.
I see no errors from AWS's Elasticsearch Service and in the app's logs I can see: (endpoint intentionally omitted)
Feb 13, 2020 10:18:46.116 AM INFO io.fusionauth.api.service.search.ElasticSearchClientProvider - Connecting to FusionAuth Search Engine at [https://vpc-<<omitted>>.eu-west-1.es.amazonaws.com]
13-Feb-2020 11:19:55.176 INFO [http-nio-9011-exec-3] org.apache.coyote.http11.Http11Processor.service Error parsing HTTP request header
Note: further occurrences of HTTP header parsing errors will be logged at DEBUG level.
java.lang.IllegalArgumentException: Invalid character found in method name. HTTP method names must be tokens
at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:430)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:684)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:808)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
"/usr/local/fusionauth/logs/fusionauth-app.log" [readonly] 43708L, 4308629C 42183,1 96%
at io.fusionauth.api.service.search.client.domain.documents.IndexUser.<init>(IndexUser.java:79)
at io.fusionauth.api.service.search.ElasticsearchSearchEngine.lambda$index$1(ElasticsearchSearchEngine.java:140)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.fusionauth.api.service.search.ElasticsearchSearchEngine.index(ElasticsearchSearchEngine.java:140)
at io.fusionauth.api.service.user.ReindexRunner$ReindexWorker.run(ReindexRunner.java:101)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-14" java.lang.NullPointerException
at io.fusionauth.api.service.search.client.domain.documents.IndexUser.<init>(IndexUser.java:79)
at io.fusionauth.api.service.search.ElasticsearchSearchEngine.lambda$index$1(ElasticsearchSearchEngine.java:140)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.fusionauth.api.service.search.ElasticsearchSearchEngine.index(ElasticsearchSearchEngine.java:140)
at io.fusionauth.api.service.user.ReindexRunner$ReindexWorker.run(ReindexRunner.java:101)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-13" java.lang.NullPointerException
Exception in thread "Thread-11" java.lang.NullPointerException
Exception in thread "Thread-12" java.lang.NullPointerException
Feb 18, 2020 10:23:29.064 AM INFO io.fusionauth.api.service.user.ReindexRunner - Reindex completed in [86797] ms or [86] seconds.
Although there are some exceptions there is also an "Reindex completed" INFO log at the end.
Not knowing the ins-and-outs of Elasticsearch I'm also not sure where to start in investigating a NullPointerException.

It looks like the re-index operation is taking an exception which is likely the cause of the truncated index.
Exception in thread "Thread-14" java.lang.NullPointerException
at io.fusionauth.api.service.search.client.domain.documents.IndexUser.<init>(IndexUser.java:79)
This code makes an assumption that you have a username or email address. This should be enforced by the FusionAuth APIs. But in this case, for this exception to occur the email and username are both NULL.
How did you get users into the db, using the Import API, User API, or something else?
In theory you should find at least one user with a NULL value for the email and username.
This query - or similar - should find the offending user, then we need to identify how this user was added to FusionAuth.
SELECT email, username from identities WHERE email IS NULL OR username IS NULL

How to block Java Thin client request till preloading of data in Ignite cluster is completed

We are running an ignite cluster with 3 nodes which pre-loads the data from 3rd party database (using custom cache store). when we try to connect to the cluster using java thin client and if the request reaches the cluster before data loading gets completed, we are getting unknown pair exception and some unstable behavior.
Is there anyway we can block the client request (TCP socket connection) till the data loading gets completed?
I tried with different life cycle events (NODE_START_COMPLETED) but no luck.
Stack trace
Caused by: org.apache.ignite.binary.BinaryInvalidTypeException: Unknown pair [platformId=0, typeId=-845247802]
at org.apache.ignite.internal.binary.BinaryContext.descriptorForTypeId(BinaryContext.java:707)
at org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1757)
at org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1716)
at org.apache.ignite.internal.binary.BinaryObjectImpl.deserializeValue(BinaryObjectImpl.java:798)
at org.apache.ignite.internal.binary.BinaryObjectImpl.value(BinaryObjectImpl.java:143)
at org.apache.ignite.internal.processors.cache.CacheObjectUtils.unwrapBinary(CacheObjectUtils.java:177)
at org.apache.ignite.internal.processors.cache.CacheObjectUtils.unwrapBinaryIfNeeded(CacheObjectUtils.java:67)
at org.apache.ignite.internal.processors.cache.CacheObjectContext.unwrapBinaryIfNeeded(CacheObjectContext.java:125)
at org.apache.ignite.internal.processors.cache.GridCacheContext.unwrapBinaryIfNeeded(GridCacheContext.java:1773)
at org.apache.ignite.internal.processors.cache.GridCacheContext.unwrapBinaryIfNeeded(GridCacheContext.java:1761)
at org.apache.ignite.internal.processors.cache.store.GridCacheStoreManagerAdapter.put(GridCacheStoreManagerAdapter.java:573)
at org.apache.ignite.internal.processors.cache.store.GridCacheStoreManagerAdapter.putAll(GridCacheStoreManagerAdapter.java:627)
at org.apache.ignite.internal.processors.cache.transactions.IgniteTxAdapter.batchStoreCommit(IgniteTxAdapter.java:1507)
at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.userCommit(IgniteTxLocalAdapter.java:589)
at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.localFinish(GridNearTxLocal.java:3646)
at org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxFinishFuture.doFinish(GridNearTxFinishFuture.java:475)
... 41 common frames omitted
Caused by: java.lang.ClassNotFoundException: Unknown pair [platformId=0, typeId=-845247802]
at org.apache.ignite.internal.MarshallerContextImpl.getClassName(MarshallerContextImpl.java:394)
at org.apache.ignite.internal.MarshallerContextImpl.getClass(MarshallerContextImpl.java:344)
at org.apache.ignite.internal.binary.BinaryContext.descriptorForTypeId(BinaryContext.java:698)
... 56 common frames omitted

There is no way to forbid thin clients to connect to a cluster using Ignite API at the moment. I created a JIRA ticket for this improvement: https://issues.apache.org/jira/browse/IGNITE-12237
The unknown pair exception doesn't seem to be caused by thin clients connecting at a wrong time though. Usually it's caused by a missing marshaller directory in the work path.

How to handle akka AskTimeoutException when submiting flink job

Flink 1.5.3, When I submit flink job to flink cluster (on yarn), it always throw AskTimeoutException. In flink configuration file, I have configed the parmater "akka.ask.timeout=1000s" , but the Exception is still like this below.
That means I have increased the timeout parameter, "akka.ask.timeout=1000s" , but it does not work.
org.apache.flink.runtime.rest.handler.RestHandlerException: Job submission failed.
at org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$2(JobSubmitHandler.java:116)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-1851759541]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
... 21 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-1851759541]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
... 9 more
So is there any solution to avoid this issue?

The timeouts of the communication between the REST handlers and the Flink cluster is controlled by web.timeout. The timeout is specified in milliseconds and, thus, you would need to set it to web.timeout: 1000000 in your flink-conf.yaml if you want to wait 1000s.
Moreover, it would be good to check the cluster entrypoint logs why the job submission takes so long. Usually it should not take longer than 10 seconds.

org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition

Using java I am sending a json object to kafka, initially it worked for me for 2 days, now I am getting the following exception
Exception in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:65)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:52)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
at dummy.DummySyntheticManifestProducer.main(DummySyntheticManifestProducer.java:164)
Caused by: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

setting the retries property on the producer settings.
also need to set the property max.in.flight.requests.per.connection to 1

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Flink cluster deployed with "Fencing token not set exception" - java

Related

Failed to setup gcp repository using elasticsearch operator

FusionAuth incomplete reindex with AWS Elasticsearch

How to block Java Thin client request till preloading of data in Ignite cluster is completed

How to handle akka AskTimeoutException when submiting flink job

org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition

Categories

Resources