After enabling checkpoint - org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException - java

I have enabled checkpoint in flink 1.12.1 programmatically as below:
int duration = 10 ;
if (!environment.getCheckpointConfig().isCheckpointingEnabled()) {
environment.enableCheckpointing(duration * 6 * 1000, CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(duration * 3 * 1000);
}
Flink Version: 1.12.1
configuration:
state.backend: rocksdb
state.checkpoints.dir: file:///flink/
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
taskmanager.data.port: 6121
classloader.resolve-order: parent-first
execution.checkpointing.unaligned: false
execution.checkpointing.max-concurrent-checkpoints: 2
execution.checkpointing.interval: 60000
But it is failing with following error:
Caused by: org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException: Could not fulfill slot request 44ec308e34aa86629d2034a017b8ef91. Requested resource profile (ResourceProfile{UNKNOWN}) is unfulfillable.
If I remove/disable checkpoint, everything works normally. I have checkpoint requirement because, if my pod, gets restart, data which is being handled by earlier run gets reset.
Can somebody direct, how this can be addressed?

Related

How can I get notified when money has been sent to a particular Bitcoin address on a local regtest network?

I want to programmatically detect whenever someone sends Bitcoin to some address. This happens on a local testnet which I start using this docker-compose.yml file.
Once the local testnet runs, I create a new address using
docker exec -it minimal-crypto-exchange_node_1 bitcoin-cli getnewaddress
Let's say it returns 2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181.
I put this address into the following Java code:
import org.bitcoinj.core.Address;
import org.bitcoinj.core.Coin;
import org.bitcoinj.core.NetworkParameters;
import org.bitcoinj.core.Transaction;
import org.bitcoinj.wallet.Wallet;
import org.bitcoinj.wallet.listeners.WalletCoinsReceivedEventListener;
public class WalletObserver {
public void init() {
final NetworkParameters netParams = NetworkParameters.fromID(NetworkParameters.ID_REGTEST);
try {
final Wallet wallet = Wallet.createBasic(netParams);
wallet.addWatchedAddress(Address.fromString(netParams, "2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181"));
wallet.addCoinsReceivedEventListener(new WalletCoinsReceivedEventListener() {
#Override
public void onCoinsReceived(final Wallet wallet, final Transaction transaction, final Coin prevBalance, final Coin newBalance) {
System.out.println("Heyo!");
}
});
}
catch (Exception exception) {
exception.printStackTrace();
}
}
}
Then I start the Java application with this class.
Then I send some test Bitcoin to the address in question:
% docker exec -it minimal-crypto-exchange_node_1 bitcoin-cli sendtoaddress 2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181 0.5
068c377bab961356ad9a3919229a764aa929711c68aefd5dbd4c7c348eef3406
If I go to http://localhost:3002/tx/068c377bab961356ad9a3919229a764aa929711c68aefd5dbd4c7c348eef3406, I see that the transaction details.
However, the breakpoint in the listener (onCoinsReceived method) never activates.
How do I need to modify my code and/or the commands I use to send test BTC so that whenever money is received by that account, onCoinsReceived method is called? Is there a place where I can tell Wallet or NetworkParameters that I want to connect to localhost?
I am using version 0.15.10 of bitcoinj-core.
Update 1:
I modified docker-compose.yml and added following port mappings:
ports:
- "51001:50001"
- "51002:50002"
- "19001:19001"
- "19000:19000"
- "28332:28332"
Then I rewrote the init method so that I can connect to localhost and specify the port:
public class WalletObserver {
public void init() {
final LocalTestNetParams netParams = new LocalTestNetParams();
netParams.setPort(50001);
try {
final WalletAppKit kit = new WalletAppKit(netParams, new File("."), "_minimalCryptoExchangeBtcWallet");
kit.setAutoSave(true);
kit.connectToLocalHost();
kit.startAsync();
kit.awaitRunning(); // I never get past this point
kit.peerGroup().addPeerDiscovery(new DnsDiscovery(netParams));
kit.wallet().addWatchedAddress(Address.fromString(netParams, "2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181"));
kit.wallet().addCoinsReceivedEventListener(new WalletCoinsReceivedEventListener() {
#Override
public void onCoinsReceived(final Wallet wallet, final Transaction transaction, final Coin prevBalance, final Coin newBalance) {
System.out.println("Heyo!");
}
});
}
catch (Exception exception) {
exception.printStackTrace();
}
}
LocalTestNetParams allows to specify the port:
package com.dpisarenko.minimalcryptoexchange.logic.btc;
import org.bitcoinj.params.RegTestParams;
public class LocalTestNetParams extends RegTestParams {
public void setPort(final int newPort) {
this.port = newPort;
}
}
I tried all of the aforementioned ports in netParams.setPort(50001);.
In all cases I get the following messages after kit.awaitRunning();:
22:16:34.245 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Attempting connection to [10.10.1.218]:50001 (0 connected, 1 pending, 1 max)
22:16:34.265 [NioClientManager] WARN org.bitcoinj.net.NioClientManager - Failed to connect with exception: java.net.ConnectException: Connection refused
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820)
at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:64)
at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:51)
at java.base/java.lang.Thread.run(Thread.java:830)
22:16:34.267 [NioClientManager] INFO org.bitcoinj.core.PeerGroup - [10.10.1.218]:50001: Peer died (0 connected, 0 pending, 1 max)
22:16:34.267 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Peer discovery took 21.84 μs and returned 0 items from 0 discoverers
22:16:34.269 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Waiting 1502 ms before next connect attempt to [10.10.1.218]:50001
22:16:35.776 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Attempting connection to [10.10.1.218]:50001 (0 connected, 1 pending, 1 max)
22:16:35.778 [NioClientManager] WARN org.bitcoinj.net.NioClientManager - Failed to connect with exception: java.net.ConnectException: Connection refused
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:820)
at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:64)
at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:51)
at java.base/java.lang.Thread.run(Thread.java:830)
22:16:35.778 [NioClientManager] INFO org.bitcoinj.core.PeerGroup - [10.10.1.218]:50001: Peer died (0 connected, 0 pending, 1 max)
22:16:35.779 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Peer discovery took 8.752 μs and returned 0 items from 0 discoverers
10.10.1.218 seems to be generated by InetAddress.getLocalHost() in org.bitcoinj.kits.WalletAppKit#connectToLocalHost:
public WalletAppKit connectToLocalHost() {
try {
InetAddress localHost = InetAddress.getLocalHost();
return this.setPeerNodes(new PeerAddress(this.params, localHost, this.params.getPort()));
} catch (UnknownHostException var2) {
throw new RuntimeException(var2);
}
}
Update 1:
I tried to use network_mode: "host".
If I add it to node as in
node:
image: ulamlabs/bitcoind-custom-regtest:latest
network_mode: "host"
I get the following error when I run docker-compose up -d:
minimal-crypto-exchange % docker-compose up -d
Creating network "minimal-crypto-exchange_default" with the default driver
Creating minimal-crypto-exchange_postgres_1 ... done
Creating minimal-crypto-exchange_geth_1 ...
Creating minimal-crypto-exchange_node_1 ... done
Creating minimal-crypto-exchange_electrumx_1 ...
Creating minimal-crypto-exchange_electrumx_1 ... error
ERROR: for minimal-crypto-exchange_electrumx_1 Cannot start service electrumx: driver fail
Creating minimal-crypto-exchange_geth_1 ... done
f68d0f25a0512399877bc55434513def810649e4fcf31a5a88ca3292d34): Error starting userland proxy: listen tcp4 0.0.0.0:28332: bind: address already in use
Creating minimal-crypto-exchange_blockscout_1 ... done
ERROR: for electrumx Cannot start service electrumx: driver failed programming external connectivity on endpoint minimal-crypto-exchange_electrumx_1 (8eaa4f68d0f25a0512399877bc55434513def810649e4fcf31a5a88ca3292d34): Error starting userland proxy: listen tcp4 0.0.0.0:28332: bind: address already in use
ERROR: Encountered errors while bringing up the project.
If I add it to electrumx part as in
electrumx:
image: lukechilds/electrumx:latest
network_mode: "host"
I get another error:
minimal-crypto-exchange % docker-compose up -d
minimal-crypto-exchange_postgres_1 is up-to-date
minimal-crypto-exchange_geth_1 is up-to-date
Recreating minimal-crypto-exchange_node_1 ...
Recreating minimal-crypto-exchange_node_1 ... done
Recreating minimal-crypto-exchange_electrumx_1 ...
ERROR: for minimal-crypto-exchange_electrumx_1 "host" network_mode is incompatible with port_bindings
ERROR: for electrumx "host" network_mode is incompatible with port_bindings
Traceback (most recent call last):
File "docker-compose", line 3, in <module>
File "compose/cli/main.py", line 81, in main
File "compose/cli/main.py", line 203, in perform_command
File "compose/metrics/decorator.py", line 18, in wrapper
File "compose/cli/main.py", line 1186, in up
File "compose/cli/main.py", line 1166, in up
File "compose/project.py", line 697, in up
File "compose/parallel.py", line 108, in parallel_execute
File "compose/parallel.py", line 206, in producer
File "compose/project.py", line 679, in do
File "compose/service.py", line 579, in execute_convergence_plan
File "compose/service.py", line 499, in _execute_convergence_recreate
File "compose/parallel.py", line 108, in parallel_execute
File "compose/parallel.py", line 206, in producer
File "compose/service.py", line 494, in recreate
File "compose/service.py", line 612, in recreate_container
File "compose/service.py", line 330, in create_container
File "compose/service.py", line 939, in _get_container_create_options
File "compose/service.py", line 1014, in _get_container_host_config
File "docker/api/container.py", line 598, in create_host_config
File "docker/types/containers.py", line 338, in __init__
docker.errors.InvalidArgument: "host" network_mode is incompatible with port_bindings
[44262] Failed to execute script docker-compose
Update 2:
If I comment out port bindings as in
electrumx:
image: lukechilds/electrumx:latest
network_mode: host
links:
- node
# Port settings see https://github.com/ulamlabs/bitcoind-custom-regtest
# ports:
# - "51001:50001"
# - "51002:50002"
# - "19001:19001"
# - "19000:19000"
# - "28332:28332"
and run docker-compose up -d I get
% docker-compose up -d
Creating network "minimal-crypto-exchange_default" with the default driver
Creating minimal-crypto-exchange_geth_1 ...
Creating minimal-crypto-exchange_postgres_1 ... done
Creating minimal-crypto-exchange_node_1 ... done
Creating minimal-crypto-exchange_electrumx_1 ... error
Creating minimal-crypto-exchange_geth_1 ... done
ERROR: for minimal-crypto-exchange_electrumx_1 Cannot create container for service electrumx: conflicting options: host type networking can't be used with links. This would result in undefined behavior
Creating minimal-crypto-exchange_blockscout_1 ... done
ERROR: for electrumx Cannot create container for service electrumx: conflicting options: host type networking can't be used with links. This would result in undefined behavior
ERROR: Encountered errors while bringing up the project.
Update 3: I assume that the root of the error is that in my Java code I try to connect to the ElectrumX server instead of the actual Bitcoin node (node in docker-compose.yml).
Update 4:
I changed docker-compose.yml as follows:
node:
image: ulamlabs/bitcoind-custom-regtest:latest
# For ports used by node see
# https://github.com/ulamlabs/bitcoind-custom-regtest/blob/master/bitcoin.conf
ports:
- "19001:19001"
- "19000:19000"
- "28332:28332"
electrumx:
image: lukechilds/electrumx:latest
links:
- node
# Port settings see https://github.com/ulamlabs/bitcoind-custom-regtest
ports:
- "51001:50001"
- "51002:50002"
# - "19001:19001"
# - "19000:19000"
# - "28332:28332"
Now I am getting different errors (full log available here):
11:33:51.865 [NioClientManager] INFO org.bitcoinj.core.PeerGroup - [192.168.10.208]:19000: Peer died (0 connected, 0 pending, 1 max)
11:33:51.865 [NioClientManager] INFO org.bitcoinj.core.PeerGroup - Not yet setting download peer because there is no clear candidate.
11:33:51.865 [NioClientManager] DEBUG org.bitcoinj.core.BitcoinSerializer - Received 168 byte 'alert' message: 60010000000000000000000000ffffff7f00000000ffffff7ffeffff7f01ffffff7f00000000ffffff7f00ffffff7f002f555247454e543a20416c657274206b657920636f6d70726f6d697365642c2075706772616465207265717569726564004630440220653febd6410f470f6bae11cad19c48413becb1ac2c17f908fd0fd53bdc3abd5202206d0e9c96fe88d4a0f01ed9dedae2b6f9e00da94cad0fecaae66ecf689bf71b50
11:33:51.866 [PeerGroup Thread] INFO org.bitcoinj.core.PeerGroup - Waiting 999 ms before next connect attempt to [127.0.0.1]:19000
11:33:51.866 [NioClientManager] DEBUG org.bitcoinj.core.Peer - Received alert from peer Peer{[192.168.10.208]:19000, version=70015, subVer=/Satoshi:0.19.1(bitcore)/, services=1033 (NETWORK, WITNESS, NETWORK_LIMITED), time=2021-11-06 11:33:52, height=5}: URGENT: Alert key compromised, upgrade required
11:33:51.867 [NioClientManager] WARN org.bitcoinj.net.ConnectionHandler - Error handling SelectionKey: java.nio.channels.CancelledKeyException
java.nio.channels.CancelledKeyException: null
at java.base/sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:71)
at java.base/sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:130)
at java.base/java.nio.channels.SelectionKey.isWritable(SelectionKey.java:377)
at org.bitcoinj.net.ConnectionHandler.handleKey(ConnectionHandler.java:244)
at org.bitcoinj.net.NioClientManager.handleKey(NioClientManager.java:86)
at org.bitcoinj.net.NioClientManager.run(NioClientManager.java:122)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
at org.bitcoinj.utils.ContextPropagatingThreadFactory$1.run(ContextPropagatingThreadFactory.java:51)
at java.base/java.lang.Thread.run(Thread.java:830)
Update 5:
Someone suggested (in a now removed comment) that in the output of the application there is this Peer does not support bloom filtering message:
11:32:43.482 [NioClientManager] INFO org.bitcoinj.core.Peer - Peer{[127.0.0.1]:19000, version=70015, subVer=/Satoshi:0.19.1(bitcore)/, services=1033 (NETWORK, WITNESS, NETWORK_LIMITED), time=2021-11-06 11:32:43, height=4}: Peer does not support bloom filtering.
So I tried to fork the original image and change the bitcoin.conf file to enable Bloom filtering:
peerbloomfilters=1
When I run docker build -t mentiflectax/bitcoind-custom-regtest:latest . I get the following error message (part of remaining output can be found here):
#13 922.4 g++: fatal error: Killed signal terminated program cc1plus
#13 922.4 compilation terminated.
#13 922.4 make[2]: *** [Makefile:8044: libbitcoin_server_a-init.o] Error 1
#13 922.4 make[2]: *** Waiting for unfinished jobs....
#13 965.8 make[2]: Leaving directory '/bitcoin-0.19.1/src'
#13 965.8 make[1]: *** [Makefile:13765: all-recursive] Error 1
#13 965.9 make[1]: Leaving directory '/bitcoin-0.19.1/src'
#13 965.9 make: *** [Makefile:776: all-recursive] Error 1
------
executor failed running [/bin/sh -c tar -xzf *.tar.gz && cd bitcoin-${BITCOIN_VERSION} && sed -i 's/consensus.nSubsidyHalvingInterval = 150/consensus.nSubsidyHalvingInterval = 210000/g' src/chainparams.cpp && ./autogen.sh && ./configure LDFLAGS=-L`ls -d /opt/db`/lib/ CPPFLAGS=-I`ls -d /opt/db`/include/ --prefix=/opt/bitcoin --disable-man --disable-tests --disable-bench --disable-ccache --with-gui=no --enable-util-cli --with-daemon && make -j4 && make install && strip /opt/bitcoin/bin/bitcoin-cli && strip /opt/bitcoin/bin/bitcoind]: exit code: 2
Update 6: The correct port seems to be 19000.
If I use port 19001, I get following errors after kit.awaitRunning():
INFO org.bitcoinj.core.PeerSocketHandler - [127.0.0.1]:19001: Timed out
Full log output is available here.
I haven't tested your full setup with electrumx and the ethereum stuff present in your docker-compose file, but regarding your problem, the following steps worked properly, and I think it will do as well in your complete setup.
I ran with docker a bitcoin node based in the ulamlabs/bitcoind-custom-regtest:latest image you provided:
docker run -p 18444:19000 -d ulamlabs/bitcoind-custom-regtest:latest
As you can see, I exposed the image internal port 19000 as the default port for RegTestParams, 18444. From the point of view of our client, with this setup, basically it will look like as if we were running the bitcoin daemon in the host. Using your LocalTestNetParams class and providing the port 19000 as you indicated should do the trick as well.
Then, according to the feedback you provided in the question, I manually edited the daemon configuration of the bitcoin node in /root/.bitcoin/bitcoin.conf using bash and vi:
docker exec -it 0aa2e863cd9927 bash
And included the following configuration:
peerbloomfilters=1
After restart the container, I got a new address:
docker exec -it 0aa2e863cd9927 bitcoin-cli -regtest getnewaddress
Let's assume that the new address is the one you provided in the question:
2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181
Then, as suggested in the Bitcoin documentation, in order to avoid an insufficient funds error, I generated 101 blocks to this address:
docker exec -it 0aa2e863cd9927 bitcoin-cli -regtest generatetoaddress 101 2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181
I used generatetoaddress and not generate because since Bitcoin 0.19.0 the option is no longer valid.
Next, I prepared a simple Java program, based in the information you provided and this example from the Bitcoinj library documentation:
import java.io.File;
import org.bitcoinj.core.Address;
import org.bitcoinj.core.NetworkParameters;
import org.bitcoinj.kits.WalletAppKit;
import org.bitcoinj.params.RegTestParams;
public class Kit {
public static void main(String[] args) {
Kit kit = new Kit();
kit.run();
}
private synchronized void run(){
NetworkParameters params = RegTestParams.get();
WalletAppKit kit = new WalletAppKit(params, new File("."), "walletappkit-example");
kit.connectToLocalHost();
kit.startAsync();
kit.awaitRunning();
kit.wallet().addWatchedAddress(Address.fromString(params, "2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181"));
kit.wallet().addCoinsReceivedEventListener((wallet, tx, prevBalance, newBalance) -> {
System.out.println("-----> coins resceived: " + tx.getTxId());
});
while (true) {
try {
this.wait(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
I used a simple while loop to keep the problem running; of course, if will be probably unnecessary in an actual setup as it seems you are using Spring Boot.
Then, if you send some bitcoins to this address:
docker exec -it 0aa2e863cd9927 bitcoin-cli -regtest sendtoaddress 2N23tWAFEtBtTgxNjBNmnwzsiPdLcNek181 0.00001
0f972642713c72ae0fe03fe51818b9ea4d483720b69b90e795f35eb80a587c26
The listener should be invoked:
2021-11-09 23:51:20.537 INFO [NioClientManager][Wallet] Received a pending transaction 0f972642713c72ae0fe03fe51818b9ea4d483720b69b90e795f35eb80a587c26 that spends 0.00 BTC from our own wallet, and sends us 0.00001 BTC
2021-11-09 23:51:20.537 INFO [NioClientManager][Wallet] commitTx of 0f972642713c72ae0fe03fe51818b9ea4d483720b69b90e795f35eb80a587c26
...
2021-11-09 23:51:20.537 INFO [NioClientManager][Wallet] ->pending: 0f972642713c72ae0fe03fe51818b9ea4d483720b69b90e795f35eb80a587c26
2021-11-09 23:51:20.537 INFO [NioClientManager][Wallet] Estimated balance is now: 0.00001 BTC
-----> coins resceived: 0f972642713c72ae0fe03fe51818b9ea4d483720b69b90e795f35eb80a587c26
2021-11-09 23:51:20.538 INFO [NioClientManager][WalletFiles] Saving wallet; last seen block is height 165, date 2021-11-09T22:50:48Z, hash 23451521947bc5ff098c088ae0fc445becca8837d39ee8f6dd88f2c47ad5ac23
2021-11-09 23:51:20.543 INFO [NioClientManager][WalletFiles] Save completed in 4.736 ms
There is still a problem you mentioned that I haven't had the opportunity to test, and it is creating a new Docker image in which the peerbloomfilters configuration would be configured properly without modifying the actual container state. I think the compilation problem you indicated could be related to this issue, basically, that the container didn't have enough resources to perform the process. If you are using macOS and Docker for Mac, try tweaking the amount of memory available to your containers, it may be of help. A change in the base alpine image used can motivate the problem also. I will try digging into the issue as well.

SCDF: Error handling when pod failed to start

I'm working on a service where it will call Spring Cloud Dataflow (SCDF) to spin off a new k8s Pod for Spring Batch job.
Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);
LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);
Everything works fine. The problem is, whenever we pull a non-existing image (eg: due to some connection issue), the pod failed to start AND we end up with pending tasks (with NO batch jobs created whatever)
For example, we will have tasks in the table task_execution (SCDF table) with empty end time
But no related jobs in batch_job_execution table.
It seems fine at first since no pod is created, we don't consume any resource. But as the number of "pending jobs" reached 20, we have the famous error:
Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]
I'm trying to find a way to detect that the pod spin-off has failed (and hence we should mark the task as error), but to no avail.
Is there a way to detect if the task launch has failed when that task launch a new k8s pod?
UPDATE
Not sure if it is relevant, we are using SCDF 1.7.3.RELEASE
Describe the failed pod:
Name: podname-lp2nyowgmm
Namespace: my-namespace
Priority: 1000
Priority Class Name: test-cluster-default
Node: some-ip.compute.internal/XX.XXX.XXX.XX
Start Time: Thu, 14 Jan 2021 18:47:52 +0700
Labels: role=spring-app
spring-app-id=podname-lp2nyowgmm
spring-deployment-id=podname-lp2nyowgmm
task-name=podname
Annotations: iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
kubernetes.io/psp: eks.privileged
Status: Pending
IP: XX.XXX.XXX.XXX
IPs:
IP: XX.XXX.XXX.XXX
Containers:
podname-lp2nyowgmm:
Container ID:
Image: image_host:XXX/mysystem/myapp:notExist
Image ID:
Port: <none>
Host Port: <none>
Args:
--spring.datasource.username=postgres
--spring.cloud.task.name=podname
--spring.datasource.url=jdbc:postgresql://...
--spring.datasource.driverClassName=org.postgresql.Driver
--spring.datasource.password=XXXX
--fileId=XXXXXXXXXXX
--spring.application.name=app-name
--fileName=file_name.csv
...
--spring.cloud.task.executionid=3
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8Gi
Requests:
cpu: 2
memory: 8Gi
Environment:
ELASTIC_SEARCH_PORT: 80
ELASTIC_SEARCH_PROTOCOL: http
SPRING_RABBITMQ_PORT: ${RABBITMQ_SERVICE_PORT}
ELASTIC_SEARCH_URL: elasticsearch
SPRING_PROFILES_ACTIVE: kubernetes
CLIENT_SECRET: ${CLIENT_SECRET}
SPRING_RABBITMQ_HOST: ${RABBITMQ_SERVICE_HOST}
RELEASE_ENV_NAME: QA_TEST
SPRING_CLOUD_APPLICATION_GUID: ${HOSTNAME}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xxxxx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xxxxx
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m22s default-scheduler Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
Normal Pulling 103s (x4 over 3m21s) kubelet Pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 102s (x4 over 3m19s) kubelet Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
Warning Failed 102s (x4 over 3m19s) kubelet Error: ErrImagePull
Normal BackOff 88s (x6 over 3m19s) kubelet Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
Warning Failed 73s (x7 over 3m19s) kubelet Error: ImagePullBackOff
1.7.3 is a very old release. We just released 2.7. The original logic used the task execution tables instead of the pod status. If the version you are using is subject to that, then it would explain what you are seeing. I strongly recommend an upgrade.
Thanks for the question. Looking at the source code, we don't include Pendingpods when calculating the current number of executing tasks. It may be something else is going on. 1) Could you run kubectl describe pod on a pod when it's in this state and post the result? (status details). 2) Is the deployer configured to create a job for each task? (false by default).

(lettuce) READONLY You can't write against a read only slave

I need some help, Our service uses the lettuce 5.1.6 version, and a total of 22 docker nodes are deployed.
Whenever the service is deployed, several docker nodes will appear ERROR: READONLY You can't write against a read only slave.
Restart the problematic docker node ERROR no longer appears
redis server configuration:
8 master 8 slave
stop-writes-on-bgsave-error no
slave-serve-stale-data yes
slave-read-only yes
cluster-enabled yes
cluster-config-file "/data/server/redis-cluster/{port}/conf/node.conf"
lettuce configuration:
ClientResources res = DefaultClientResources.builder()
.commandLatencyPublisherOptions(
DefaultEventPublisherOptions.builder()
.eventEmitInterval(Duration.ofSeconds(5))
.build()
)
.build();
redisClusterClient = RedisClusterClient.create(res, REDIS_CLUSTER_URI);
redisClusterClient.setOptions(
ClusterClientOptions.builder()
.maxRedirects(99)
.socketOptions(SocketOptions.builder().keepAlive(true).build())
.topologyRefreshOptions(
ClusterTopologyRefreshOptions.builder()
.enableAllAdaptiveRefreshTriggers()
.build())
.build());
RedisAdvancedClusterCommands<String, String> command = redisClusterClient.connect().sync();
command.setex("some key", 18000, "some value");
The Exception that appears:
io.lettuce.core.RedisCommandExecutionException: READONLY You can't write against a read only slave.
at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:122)
at io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(ClusterFutureSyncInvocationHandler.java:123)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy135.setex(Unknown Source)
at com.xueqiu.infra.redis4.RedisClusterImpl.lambda$setex$164(RedisClusterImpl.java:1489)
at com.xueqiu.infra.redis4.RedisClusterImpl$$Lambda$1422/1017847781.apply(Unknown Source)
at com.xueqiu.infra.redis4.RedisClusterImpl.execute(RedisClusterImpl.java:526)
at com.xueqiu.infra.redis4.RedisClusterImpl.executeTotal(RedisClusterImpl.java:491)
at com.xueqiu.infra.redis4.RedisClusterImpl.setex(RedisClusterImpl.java:1489)
In the face of distributed middleware, the client side will put some partitions, sharding and other relationships on the client side for management.
And lettuce is the slot mapping management of redis cluster:
The method adopted is to use an array of slotCache, and cache the node corresponding to each slot locally in the form of an array.
When there is a key that needs to read and write to the server, the slot will be calculated through the CRC16 in the client, and then the node will be obtained in the cache.
When the redis cluster server performs cluster management, it records the mapping relationship between slot and node in the local node.conf of each node.
When ping pong data is exchanged through the gossip protocol, these metadata information are broadcast to form the final consistent metadata information.
However, if there is an error in the slot mapping relationship on the server side, the client side will use these wrong data.
This time the problem appears here. The server part node maps the slot to the slave, so that the slot cached by the client is mapped to the slave node, and the read and write requests are sent to the slave node, resulting in an error.
lettuce source code investigation
1 lettuce initialization Partitions.java
/**
* Update the partition cache. Updates are necessary after the partition details have changed.
*/
public void updateCache() {
synchronized (partitions) {
if (partitions.isEmpty()) {
this.slotCache = EMPTY;
this.nodeReadView = Collections.emptyList();
return;
}
RedisClusterNode[] slotCache = new RedisClusterNode[SlotHash.SLOT_COUNT];
List<RedisClusterNode> readView = new ArrayList<>(partitions.size());
for (RedisClusterNode partition: partitions) {
readView.add(partition);
for (Integer integer: partition.getSlots()) {
slotCache[integer.intValue()] = partition;
}
}
this.slotCache = slotCache;
this.nodeReadView = Collections.unmodifiableCollection(readView);
}
}
2 lettuce send command PooledClusterConnectionProvider.java
private CompletableFuture<StatefulRedisConnection<K, V>> getWriteConnection(int slot) {
CompletableFuture<StatefulRedisConnection<K, V>> writer;// avoid races when reconfiguring partitions.
synchronized (stateLock) {
writer = writers[slot];
}
if (writer == null) {
RedisClusterNode partition = partitions.getPartitionBySlot(slot);
if (partition == null) {
clusterEventListener.onUncoveredSlot(slot);
return Futures.failed(new PartitionSelectorException("Cannot determine a partition for slot "+ slot + ".",
partitions.clone()));
}
// Use always host and port for slot-oriented operations. We don't want to get reconnected on a different
// host because the nodeId can be handled by a different host.
RedisURI uri = partition.getUri();
ConnectionKey key = new ConnectionKey(Intent.WRITE, uri.getHost(), uri.getPort());
ConnectionFuture<StatefulRedisConnection<K, V>> future = getConnectionAsync(key);
return future.thenApply(connection -> {
synchronized (stateLock) {
if (writers[slot] == ​​null) {
writers[slot] = CompletableFuture.completedFuture(connection);
}
}
return connection;
}).toCompletableFuture();
}
return writer;
}
The sending principle of lettuce:
Load the topology when the client starts, and store the mapping relationship between slot and node locally in an array structure slotCache
When sending, after calculating the CRC16 of the key, go to the array slotCache through slot to get the corresponding node, and continue to get the connection of this node
Note that basically in all middleware of this cluster mode, the logic of the client is to obtain the network topology of the server, and then calculate the mapping logic on the client,
Compare the performance analysis of Kafka across computer rooms:
redis cluster information troubleshooting
./bin/redis-cli -h 10.10.28.2 -p 25661 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size: 3
cluster_current_epoch:8
cluster_my_epoch:6
cluster_stats_messages_ping_sent:615483
cluster_stats_messages_pong_sent:610194
cluster_stats_messages_meet_sent:3
cluster_stats_messages_fail_sent:8
cluster_stats_messages_auth-req_sent:5
cluster_stats_messages_auth-ack_sent:2
cluster_stats_messages_update_sent:4
cluster_stats_messages_sent:1225699
cluster_stats_messages_ping_received:610188
cluster_stats_messages_pong_received:603593
cluster_stats_messages_meet_received:2
cluster_stats_messages_fail_received:4
cluster_stats_messages_auth-req_received:2
cluster_stats_messages_auth-ack_received:2
cluster_stats_messages_received:1213791
./bin/redis-cli -h 10.10.28.2 -p 25661 cluster nodes
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921769000 15 connected
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921770000 18 connected 4096-6143
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 myself,master - 0 1595921759000 15 connected 10240-12287
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921769000 14 connected 12288-14335
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921771000 13 connected 14336-16383
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921769000 18 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921769870 16 connected 8192-10239
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921763000 14 connected
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921770870 16 connected
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921773876 0 connected 0-2047
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921768000 17 connected
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921773000 6 connected 2048-4095
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921771872 12 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921770000 6 connected
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921762000 13 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921772873 17 connected 6144-8191
./bin/redis-cli -h 10.10.28.3 -p 25662 cluster nodes
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921741000 18 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921744000 16 connected 8192-10239
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921740000 13 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921743127 17 connected 6144-8191
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921743000 18 connected 4096-6143
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 master - 0 1595921744129 15 connected 10240-12287
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921740000 16 connected
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921745130 14 connected
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 myself,slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921733000 5 connected 0-1820
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921744000 12 connected
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921739000 6 connected 2048-4095
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921742000 13 connected 14336-16383
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921746131 0 connected 1821-2047
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921747133 14 connected 12288-14335
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921742126 17 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921745000 6 connected
./bin/redis-cli -h 10.10.49.9 -p 25672 cluster nodes
d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671#35671 master - 0 1595921829000 6 connected 2048-4095
79cb673db12199c32737b959cd82ec9963106558 10.10.25.2:25651#35651 master - 0 1595921830000 18 connected 4096-6143
ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 10.10.25.3:25681#35681 master - 0 1595921830719 0 connected 0-1820
f54cfebc12c69725f471d16133e7ca3a8567dc18 10.10.28.15:25687#35687 slave 6a9ea568d6b49360afbb650c712bd7920403ba19 0 1595921827000 14 connected
5f677e012808b09c67316f6ac5bdf0ec005cd598 10.10.50.7:25676#35676 master - 0 1595921827000 17 connected 6144-8191
2281f330d771ee682221bc6c239afd68e6f20571 10.10.28.2:25661#35661 master - 0 1595921822000 15 connected 10240-12287
5e9d0c185a2ba2fc9564495730c874bea76c15fa 10.10.28.3:25662#35662 slave 2281f330d771ee682221bc6c239afd68e6f20571 0 1595921828714 15 connected
068e3bc73c27782c49782d30b66aa8b1140666ce 10.10.27.3:25682#35682 slave ff5f5a56a7866f32e84ec89482aabd9ca1f05e20 0 1595921832721 12 connected
6a9ea568d6b49360afbb650c712bd7920403ba19 10.10.28.3:25686#35686 master - 0 1595921825000 14 connected 12288-14335
f5148dba1127bd9bada8ecc39341a0b72ef25d8e 10.10.25.3:25652#35652 slave 79cb673db12199c32737b959cd82ec9963106558 0 1595921830000 18 connected
19c57214e4293b2e37d881534dcd55318fa96a70 10.10.50.16:25677#35677 slave 5f677e012808b09c67316f6ac5bdf0ec005cd598 0 1595921829716 17 connected
e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672#35672 myself,slave d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 0 1595921832000 4 connected 1821-2047
f09ad21effff245cae23c024a8a886f883634f5c 10.10.28.15:25667#35667 slave f6788b4829e601642ed4139548153830c430b932 0 1595921826711 16 connected
f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657#35657 slave 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 0 1595921829000 13 connected
f6788b4829e601642ed4139548153830c430b932 10.10.26.3:25666#35666 master - 0 1595921831720 16 connected 8192-10239
5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656#35656 master - 0 1595921827714 13 connected 14336-16383
./bin/redis-trib.rb check 10.10.30.9:25671
>>> Performing Cluster Check (using node 10.10.30.9:25671)
M: d8b4f99e0f9961f2e866b92e7351760faa3e0f2b 10.10.30.9:25671
slots:2048-4095 (2048 slots) master
1 additional replica(s)
S: e8b0311aeec4e3d285028abc377f0c277f9a5c74 10.10.49.9:25672
slots: (0 slots) slave
········
········
S: f03bc2ca91b3012f4612ecbc8c611c9f4a0e1305 10.10.27.3:25657
slots: (0 slots) slave
replicates 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757
M: 5a12dd423370e6f4085e593f9cd0b3a4ddfa9757 10.10.27.2:25656
slots:14336-16383 (2048 slots) master
1 additional replica(s)
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
Be suspicious of everything, be vigilant, diligence can make up for one’s weaknesses
At the beginning, I suspected that the cluster is healthy, but due to online phenomena, most of the nodes are normal, and a few do not have problems, restart the problem fix
In the beginning, the information was checked through normal nodes, and no problems were found. Even if the topological information of several nodes was inconsistent through the logs at the beginning, it would be difficult to see the problem without the mapping relationship of the client.
Comparison found that some slots are mapped to slave nodes
After executing the check, it is found that there is a problem with the cluster, which is explained in the following open source related issues
I saw the same issue as you and tried to investigate that.
I figured out that caused by lettuce.
When we run a Redis command, lettuce will analyze and recognize which is the Redis-Endpoint to send the order.
If READ-COMMAND then it will send to slave-node (By setting ReadFrom.Any_Rep). Please note that the other ReadFrom options may change the behavior.
If WRITE-COMMAND then it will send to master-node
To determine what are READ-COMMAND. Lettuce used ReadOnlyCommands class to list all Read commands.
In my case, I used the EVAL command to write a key value to Redis. But Lettuce determines it is READ-COMMAND then send to slave-node => The exception happens.
So please check ReadOnlyCommands class and make sure your write-command does not include that. This is a mistake from the Lettuce team and they already fix this issue from newer versions.
In your version, ReadOnlyCommands for cluster settings is
class ReadOnlyCommands {
private static final Set<CommandType> READ_ONLY_COMMANDS = EnumSet.noneOf(CommandType.class);
static {
for (CommandName commandNames : CommandName.values()) {
READ_ONLY_COMMANDS.add(CommandType.valueOf(commandNames.name()));
}
}
/**
* #param protocolKeyword must not be {#literal null}.
* #return {#literal true} if {#link ProtocolKeyword} is a read-only command.
*/
public static boolean isReadOnlyCommand(ProtocolKeyword protocolKeyword) {
return READ_ONLY_COMMANDS.contains(protocolKeyword);
}
/**
* #return an unmodifiable {#link Set} of {#link CommandType read-only} commands.
*/
public static Set<CommandType> getReadOnlyCommands() {
return Collections.unmodifiableSet(READ_ONLY_COMMANDS);
}
enum CommandName {
ASKING, BITCOUNT, BITPOS, CLIENT, COMMAND, DUMP, ECHO, EVAL, EVALSHA, EXISTS, //
GEODIST, GEOPOS, GEORADIUS, GEORADIUSBYMEMBER, GEOHASH, GET, GETBIT, //
GETRANGE, HEXISTS, HGET, HGETALL, HKEYS, HLEN, HMGET, HSCAN, HSTRLEN, //
HVALS, INFO, KEYS, LINDEX, LLEN, LRANGE, MGET, PFCOUNT, PTTL, //
RANDOMKEY, READWRITE, SCAN, SCARD, SCRIPT, //
SDIFF, SINTER, SISMEMBER, SMEMBERS, SRANDMEMBER, SSCAN, STRLEN, //
SUNION, TIME, TTL, TYPE, ZCARD, ZCOUNT, ZLEXCOUNT, ZRANGE, //
ZRANGEBYLEX, ZRANGEBYSCORE, ZRANK, ZREVRANGE, ZREVRANGEBYLEX, ZREVRANGEBYSCORE, ZREVRANK, ZSCAN, ZSCORE, //
// Pub/Sub commands are no key-space commands so they are safe to execute on slave nodes
PUBLISH, PUBSUB, PSUBSCRIBE, PUNSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE
So you can check easily.
Solution -> Upgrade version Letture is the best way to do. Or you can try to override this setting

Elastic BeanStalk loading configuration from another region failed

I have uploaded a saved configuration file in a BeanStalk application in a region to another BeanStalk application in another region.
While loading that config I got an error
Stack named 'awseb-e-sme7w3eym3-stack' aborted operation. Current
state: 'CREATE_FAILED' Reason: The following resource(s) failed to
create: [AWSEBLoadBalancer]
Creating load balancer failed Reason: Property Listeners cannot be
empty Any idea about this issue ?
See the config file
AWSConfigurationTemplateVersion: 1.1.0.0
EnvironmentConfigurationMetadata:
DateCreated: '1580272974000'
DateModified: '1580273310143'
Description: xxxxxxxxxxxxxxxxxxxxx
EnvironmentTier:
Name: WebServer
Type: Standard
OptionSettings:
AWSEBAutoScalingGroup.aws:autoscaling:updatepolicy:rollingupdate:
MaxBatchSize: '1'
MinInstancesInService: '1'
RollingUpdateEnabled: true
RollingUpdateType: Health
AWSEBAutoScalingLaunchConfiguration.aws:autoscaling:launchconfiguration:
EC2KeyName: xxxxxxxxxxxxxxxxxxx
AWSEBCloudwatchAlarmHigh.aws:autoscaling:trigger:
UpperThreshold: '60'
AWSEBCloudwatchAlarmLow.aws:autoscaling:trigger:
BreachDuration: '2'
LowerThreshold: '25'
MeasureName: CPUUtilization
Period: '1'
Statistic: Maximum
Unit: Percent
AWSEBLoadBalancerSecurityGroup.aws:ec2:vpc:
VPCId: vpc-xxxxxxxxxxxxxxxx
AWSEBV2LoadBalancerListener.aws:elbv2:listener:default:
ListenerEnabled: false
AWSEBV2LoadBalancerListener443.aws:elbv2:listener:443:
SSLCertificateArns: arn:aws:acm:us-east-2:xxxxxxxxxxx:certificate/xxxxxxx-xxxxx-xxxx-xxxx-xxxxxxxxxxxx
AWSEBV2LoadBalancerTargetGroup.aws:elasticbeanstalk:environment:process:default:
HealthCheckPath: /rest/account/ping
MatcherHTTPCode: '200'
Port: '80'
Protocol: HTTP
aws:autoscaling:launchconfiguration:
IamInstanceProfile: aws-elasticbeanstalk-ec2-role
SecurityGroups:
- sg-xxxxxxxxxxxxx
aws:ec2:instances:
InstanceTypes: t2.small
aws:ec2:vpc:
ELBSubnets: subnet-xxxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxx
Subnets: subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx,subnet-xxxxxxxxxxxxxxxxx
aws:elasticbeanstalk:application:environment:
JDBC_CONNECTION_STRING: jdbc:mysql://xxxxxxxxxxxxxxxxxxxxxxxxxxxx?user=xxxxxxxx&password=xxxxxxxxxxx&rewriteBatchedStatements=true&characterEncoding=UTF-8
aws.accessKeyId: xxxxxxxxxxxxxxxxxx
aws.secretKey: xxxxxxxxxxxxxxxxxxxx
com.aws.secretManger.secret.name: xxxxxxxxxxxxxxx
com.aws.secretManger.secret.region: us-east-2
com.decsond.loggly.token: xxxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx#xxxxx
com.decsond.metakey: xxxxxxxxxxxxxxxxx/XXX==
com.decsond.mode: debug
com.decsond.server.db.environment: aws
com.decsond.server.dpBinaryColumn: xxxxxxxxxxxx
com.decsond.server.environment: xxxxxxxxxx
com.decsond.server.type: pms
aws:elasticbeanstalk:container:tomcat:jvmoptions:
JVM Options: -XX:+CMSClassUnloadingEnabled -Dmvel.disable.jit=true -Ddrools.permgenThreshold=0
Xms: 512m
Xmx: 1024m
aws:elasticbeanstalk:environment:
LoadBalancerType: application
ServiceRole: arn:aws:iam::xxxxxxxxxxxxxx:role/aws-elasticbeanstalk-service-role
aws:elasticbeanstalk:healthreporting:system:
SystemType: enhanced
aws:elasticbeanstalk:managedactions:
ManagedActionsEnabled: true
PreferredStartTime: SAT:03:01
aws:elasticbeanstalk:managedactions:platformupdate:
InstanceRefreshEnabled: true
UpdateLevel: minor
aws:elasticbeanstalk:xray:
XRayEnabled: true
aws:elbv2:listener:443:
DefaultProcess: default
ListenerEnabled: true
Protocol: HTTPS
Rules: ''
SSLPolicy: ELBSecurityPolicy-2016-08
Platform:
PlatformArn: arn:aws:elasticbeanstalk:us-east-2::platform/Tomcat 8.5 with Java 8
running on 64bit Amazon Linux/3.3.1
Any idea about the issue ?
The most likely reason is that you are referencing objects in the region from where the config was saved from.
Is this the first EB application / environment in the new region?
If it is, it's worth first creating a test application and environment, using the features you want ... that will give EB a chance to create all the region specific behind-the-scenes magic it relies on.

Sonar stops working every few days: jdbc connection error

I have a Sonar server which is used once a day from maven/jenkins and every few days, say every 4 or 5 days, it crashes and and shows the message "We're sorry, but something went wrong".
In the log, the error is always about a jdbc connection problem. I thought that it was a problem with the database but then if I restart the sonar server everything goes fine again.
So, it looks like a memory leak or something in the sonar server that makes it crash every few days until someone restarts it. Does that makes sense?. This is the configuration I have:
sonar.jdbc.username: xxxx
sonar.jdbc.password: xxxx
sonar.jdbc.url: jdbc:mysql://x.x.x.x:3306/sonar?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true
#----- Connection pool settings
sonar.jdbc.maxActive: 20
sonar.jdbc.maxIdle: 5
sonar.jdbc.minIdle: 2
sonar.jdbc.maxWait: 5000
sonar.jdbc.minEvictableIdleTimeMillis: 600000
sonar.jdbc.timeBetweenEvictionRunsMillis: 30000
sonar.updatecenter.activate=true
http.proxyHost=xxxx
http.proxyPort=3128
sonar.notifications.delay=60
That's it. And this is the error log:
INFO o.s.s.p.ServerImpl SonarQube Server / 3.7.3 /
INFO o.s.c.p.Database Create JDBC datasource for jdbc:mysql://x.x.x.x:3306/sonar?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true
ERROR o.s.c.p.Database Can not connect to database.
Please check connectivity and settings (see the properties prefixed by 'sonar.jdbc.').
org.apache.commons.dbcp.SQLNestedException:
Cannot create PoolableConnectionFactory (Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.)
.
.
.
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
.
.
.
Caused by: java.net.ConnectException: Connection refused
.
.
.
INFO jruby.rack An exception happened during JRuby-Rack startup
no connection available
--- System
jruby 1.6.8 (ruby-1.8.7-p357) (2012-09-18 1772b40) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_43) [linux-amd64-java]
Time: Thu Jan 02 08:04:08 -0500 2014
Server: jetty/7.6.11.v20130520
jruby.home: file:/opt/sonar/war/sonar-server/WEB-INF/lib/jruby-complete-1.6.8.jar!/META-INF/jruby.home
--- Context Init Parameters:
jruby.compat.version = 1.8
jruby.max.runtimes = 1
jruby.min.runtimes = 1
jruby.rack.logging = slf4j
public.root = /
rails.env = production
--- Backtrace
ActiveRecord::ConnectionNotEstablished: no connection available
set_native_database_types at arjdbc/jdbc/RubyJdbcConnection.java:517
initialize at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-jdbc-adapter-1.1.3/lib/arjdbc/jdbc/connection.rb:61
initialize at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-jdbc-adapter-1.1.3/lib/arjdbc/jdbc/adapter.rb:31
jdbc_connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-jdbc-adapter-1.1.3/lib/arjdbc/jdbc/connection_methods.rb:6
send at org/jruby/RubyKernel.java:2109
new_connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:223
checkout_new_connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:245
checkout at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:188
loop at org/jruby/RubyKernel.java:1439
checkout at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:184
mon_synchronize at file:/opt/sonar/war/sonar-server/WEB-INF/lib/jruby-complete-1.6.8.jar!/META-INF/jruby.home/lib/ruby/1.8/monitor.rb:191
checkout at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:183
connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:98
retrieve_connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_pool.rb:326
retrieve_connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_specification.rb:123
connection at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/connection_adapters/abstract/connection_specification.rb:115
initialize at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/migration.rb:440
up at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/migration.rb:401
migrate at /opt/sonar/war/sonar-server/WEB-INF/gems/gems/activerecord-2.3.15/lib/active_record/migration.rb:383
upgrade_and_start at /opt/sonar/war/sonar-server/WEB-INF/lib/database_version.rb:62
automatic_setup at /opt/sonar/war/sonar-server/WEB-INF/lib/database_version.rb:74
(root) at /opt/sonar/war/sonar-server/WEB-INF/config/environment.rb:213
load at org/jruby/RubyKernel.java:1087
load_environment at /opt/sonar/war/sonar-server/WEB-INF/config/environment.rb:23
load_environment at file:/opt/sonar/war/sonar-server/WEB-INF/lib/jruby-rack-1.1.10.jar!/jruby/rack/rails_booter.rb:65
(root) at <script>:1
--- RubyGems
Gem.dir: /opt/sonar/war/sonar-server/WEB-INF/gems
Gem.path:
/opt/sonar/war/sonar-server/WEB-INF/gems
Activated gems:
rake-0.9.2.2
activesupport-2.3.15
activerecord-2.3.15
rack-1.1.3
actionpack-2.3.15
actionmailer-2.3.15
activeresource-2.3.15
rails-2.3.15
color-tools-1.3.0
i18n-0.4.2
json-jruby-1.2.3-universal-java-1.6
activerecord-jdbc-adapter-1.1.3
fastercsv-1.4.0
--- Bundler
undefined method `bundle_path' for Bundler:Module
--- JRuby-Rack Config
compat_version = RUBY1_8
default_logger = org.jruby.rack.logging.StandardOutLogger#4fbbe4e1
err = java.io.PrintStream#d2284af
filter_adds_html = true
filter_verifies_resource = false
ignore_environment = false
initial_memory_buffer_size =
initial_runtimes = 1
jms_connection_factory =
jms_jndi_properties =
logger = org.jruby.rack.logging.Slf4jLogger#566dc8f0
logger_class_name = slf4j
logger_name = jruby.rack
maximum_memory_buffer_size =
maximum_runtimes = 1
num_initializer_threads =
out = java.io.PrintStream#6aeeefcf
rackup =
rackup_path =
rewindable = true
runtime_arguments =
runtime_timeout_seconds =
serial_initialization = false
servlet_context = ServletContext#o.e.j.w.WebAppContext{/,file:/opt/sonar/war/sonar-server/},file:/opt/sonar/war/sonar-server
ERROR jruby.rack unable to create shared application instance
org.jruby.rack.RackInitializationException: no connection available
.
.
.
org.jruby.exceptions.RaiseException:
(ConnectionNotEstablished) no connection available
.
.
.
ERROR jruby.rack Error: application initialization failed
org.jruby.rack.RackInitializationException: no connection available
.
.
.
org.jruby.exceptions.RaiseException:
(ConnectionNotEstablished) no connection available
Any help will be appreciated :)
This sounds familiar to me :P Try asking your operators if they have some automated 'cleansing' operation that just kills periodically open database connections automatically in order to prevent leaked connections to the database.
It happened to me, in a Windows 2012 Server and SQL Server 2012 database in a different server. It seems that sonar service maintains a connection open, created at start up time, so any disconnection (networking, database restart, etc.) causes this unrecoverable connection problem. Restarting the Sonar (SonarQube) Windows service solved the problem to me. But, if this problem is frequent, as in your case, it would be a good idea to schedule a service restart task, or find out what is taking the connection down.

Categories