Kafka cluster zookeeper failure handling - java

I am going to implement a kafka cluster consisting of 3 machines, one for zookeeper and other 2 as brokers. I have about 6 consumer machines and about hundred producers.
Now if one of the broker fails data loss is avoided thanks to replication feature. But what if zookeeper fails and the same machine cannot be started? I have several questions:
I noticed that even after zookeeper failure producers continued to push messages in designated broker. But they could no longer be retrieved by consumers. Because Consumers got unregistered. So in this case is data lost permanently?
How to change zookeeper ip in broker config in run time? Will they have to be shutdown to change zookeeper ip?
Even if new zookeeper machine is somehow brought into the cluster previous would the previous data be lost?

Running only one instance of Zookeeper is not fault-tolerant and the behavior cannot be predicted. According to HBase reference, you should setup an ensemble with at least 3 servers.
Have a look at the official documentation page: Zookeeper clustered setup.

Related

Writing to a kafka follower node

Is it at all possible to write to a kafka follower node? I ask this because we sometimes encounter the following situation - a particular host containing a broker which is the leader node for some partitions becomes inaccessible from the producer host, while actually remaining up (i.e. a network issue between producer and a leader host), so a new leader is not elected.
To create a contrived example, one can block a host or a port using a firewell. In my contrived example, I have
h0: has seven brokers running on ports 9092 - 9098
h1: 3 brokers running between 9092 - 9094
h2: 3 brokers running between 9092 - 9094
h3: 3 brokers running between 9092 - 9094
I blocked outgoing port 9092, and as expected, approximately 25% of the messages do not get written, and error out, as expected.
In the real world, I have seen a host being unreachable for ~5 minutes from the producer host.
Is there any way to ensure that the message gets written to the kafka cluster?
It is not possible to produce messages to a follower, from the Kafka protocol details:
These requests to publish or fetch data must be sent to the broker that is currently acting as the leader for a given partition. This condition is enforced by the broker, so a request for a particular partition to the wrong broker will result in an the NotLeaderForPartition error code (described below).
You can come up with imaginative solutions like setting up a completely independent Kafka cluster, produce there if the main one is inaccessible and have MirrorMaker clone the data from the secondary Kafka to the main one, or forward the data to some other host that has connectivity to the Kafka cluster if your use-case really requires it. But well, most options seems a bit convoluted and costly...
Maybe is better to just buffer the data and wait to have connectivity back or investigate and invest in improving the network so there is more redundancy and becomes harder to have network partitions between hosts and the Kafka cluster.

org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata for Kafka Cluster using jaas SASL config authentication

I am trying to deploy a Google Cloud Dataflow pipeline which reads from a Kafka cluster, processes its records, and then writes the results to BigQuery. However, I keep encountering the following exception when attempting to deploy:
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata for Kafka Cluster
The Kafka cluster requires the use of a JAAS configuration for authentication, and I use the code below to set the properties required for the KafkaIO.read Apache Beam method:
// Kafka properties
Map<String, Object> kafkaProperties = new HashMap<String, Object>(){{
put("request.timeout.ms", 900000);
put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, "SASL_PLAINTEXT");
put(SaslConfigs.SASL_MECHANISM, "SCRAM-SHA-512");
put(SaslConfigs.SASL_JAAS_CONFIG, "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"USERNAME\" password=\"PASSWORD\";");
put(CommonClientConfigs.GROUP_ID_CONFIG, GROUP_ID);
}};
// Build & execute pipeline
pipeline
.apply(
"ReadFromKafka",
KafkaIO.<Long, String>read()
.withBootstrapServers(properties.getProperty("kafka.servers"))
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withTopic(properties.getProperty("kafka.topic")).withConsumerConfigUpdates(kafkaProperties))
The Dataflow pipeline is to be deployed with public IPs disabled, but there is an established VPN tunnel from our Google Cloud VPC network to the Kafka cluster and the required routing for the private ips on both sides are configured and their IPs are whitelisted. I am able to ping and connect to the socket of the Kafka server using a Compute Engine VM in the same VPN subnetwork as the Dataflow job to be deployed.
I was thinking that there is an issue with the configuration, but I am not able to figure out if I am missing an additional field, or if one of the existing ones is misconfigured. Does anyone know how I can diagnose the problem further since the exception thrown does not really pinpoint the issue? Any help would be greatly appreciated.
Edit:
I am now able to successfully deploy the Dataflow job now, however it appears as though the read is not functioning correctly. After viewing the logs to check for the errors in the Dataflow job, I can see that after discovering the group coordinator for the kafka topic, there are no other log statements before a warning log statement saying that the closing of the idle reader timed out:
Close timed out with 1 pending requests to coordinator, terminating client connections
followed by an uncaught exception with the root cause being:
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition test-partition could be determined
There is then an error stating:
Execution of work for P0 for key 0000000000000001 failed. Will retry locally.
Could this maybe be an issue with the key definition since the Kafka topics actually do not have keys in the messages? When I view the topic in Kafka Tool, the only columns observed in the data consist of Offset, Message, and a Timestamp.
Based on the last comment, I assume that you're experiencing the issue more with network stack then initially seeking for any configuration lacks in Dataflow pipeline, in terms of performing Dataflow job runners connections to Kafka brokers.
Basically, when you use Public IP addresses pool for Dataflow workers you have a simplest way to reach external Kafka cluster with no extra configuration to apply on both sides, as you don't need to launch VPC network between parties and perform routine network job to get all routes work.
However, Cloud VPN brings some more complications implementing VPC network on both sides and further adjusting VPN gateway, forwarding rules, and addressing pool for this VPC. Instead, from Dataflow runtime perspective you don't need to spread Public IP addresses between Dataflow runners and doubtlessly reduce the price.
The problem that you've mentioned primary lays on Kafka cluster side. Due to the fact that Apache Kafka is a distributed system, it has the core principle: When producer/consumer executes, it will request metadata about which broker is the leader for a partition, receiving metadata with endpoints available for that partition,thus the client then acknowledge those endpoints to connect to the particular broker. And as far as I understand in your case, the connection to leader is performing through the listener bounded to the external network interface, configured in server.properties broker setting.
Therefore, you might consider to create a separate listener (if it doesn't exist) in listeners bounded to cloud VPC network interface and if necessary propagate advertised.listeners with metadata that is going back to client, consisting data for connection to the particular broker.

Broker per node model?

I have gone through couple of Kafka tutorials on google like this one.
Based on them I have got some questions in context of Kafka :-
1. What is broker ?
Per mine understanding, Each Kafka instance hosting topic(zero or more partition) is broker .
2. Broker per node ?
I believe in practical scenario under clustering , ideally each node will have one kafka instance where each instance will hold two partitions
a. One partition(working as leader)
b. Another partition working as follower for partition on another anode.
Is this correct ?
1) Correct. A broker is an instance of the Kafka server software which runs in a Java virtual machine
2) Incorrect. A node is really the same thing as a broker. If you have three Kafka brokers running as a single cluster (for scalability and reliability) then it's said that you have a 3 node Kafka cluster. Each node is the leader for some partitions and the backup (replica) for others.
However, there are other kinds of nodes besides Kafka broker nodes. Kafka uses Zookeeper so you might have 3 or five Zookeeper nodes as well. A cluster of Zookeepers is often called an Ensemble.
In later versions of Kafka there are now different types of nodes so it's also normal to say there are 3 broker nodes, 5 Zookeeper nodes, 2 Kafka Connect nodes, and a 10 node (or instance) Kafka Streams application.
Each Kafka instance hosting zero or more topics is called a broker.
Each node can host multiple brokers, but in a production environment it makes sense to run one broker per node. Each broker typically hosts multiple topics/partitions though. Having only two partitions per Kafka broker is a waste of resources.
I hope this helps.

DatabaseLessLeasing has failed and Server is not in majority cluster partition

I'm facing a DatabaseLessLeasing issue. Our's is a middleware application. We don't have any database and our application is running on WebLogic server. We have 2 servers in one cluster. Both servers are up and running, but we are using only one server to do the processing. When the primary server fails, whole server and services will migrate to secondary server. This is working fine.
But we had one issue end of last year that our secondary server hardware was down and secondary server was not available. We got the below issue. When we went to Oracle, they suggested to have one more server or have one database which is high availability to hold the cluster leasing information to point out which is the master server. As of now we don't have that option to do as putting the new server means there will be a budget issue and client is not ready for it.
Our Weblogic configuration for cluster are:
one cluster with 2 managed servers
cluster messaging mode is Multicast
Migration Basis is Consensus
load algorithm is Round Robin
This is the log I found
LOG: Critical Health BEA-310006 Critical Subsystem
DatabaseLessLeasing has failed. Setting server state to FAILED.
Reason: Server is not in the majority cluster partition>
Critical WebLogicServer BEA-000385 Server health failed. Reason:
health of critical service 'DatabaseLessLeasing' failed Notice
WebLogicServer BEA-000365 Server state changed to FAILED
**Note: **I remember one thing, the server was not down when this happened. Both the servers were running but all of a sudden server tried to restart and it unable to restart. Restart was failed. I saw that status was showing as failedToRestart and application went down.
Can anyone please help me on this issue.
Thank you
Consensus leasing requires a majority of servers to continue functioning. Any time there is a network partition, the servers in the majority partition will continue to run while those in the minority partition will fail since they cannot contact the cluster leader or elect a new cluster leader since they will not have the majority of servers. If the partition results in an equal division of servers, then the partition that contains the cluster leader will survive while the other one will fail.
Owing to above functionality, If automatic server migration is enabled, the servers are required to contact the cluster leader and renew their leases periodically. Servers will shut themselves down if they are unable to renew their leases. The failed servers will then be automatically migrated to the machines in the majority partition.
The server which got partitioned (and not part of majority cluster) will get into FAILED state. This behavior is put in place to avoid split-brain scenarios where there are two partitions of a cluster and both think they are the real cluster. When a cluster gets segmented, the largest segment will survive and the smaller segment will shut itself down. When servers cannot reach the cluster master, they determine if they are in the larger partition or not. If they are in the larger partition, they will elect a new cluster master. If not, they will all shut down when their lease expires. Two-node clusters are problematic in this case. When a cluster gets partitioned, which partition is the largest? When the cluster master goes down in a two-node cluster, the remaining server has no way of knowing if it is in the majority or not. In that case, if the remaining server is the cluster master, it will continue to run. If it is not the master, it will shut down.
Usually this error shows up when there are only 2 managed servers in onc cluster.
To solve this kind of issues, create another server; since the cluster is only of 2 nodes, any server will fall out of the majority cluster partition if it loses connectivity/drops cluster broadcast messages. In this scenario, there are no other servers part of the cluster.
For Consensus Leasing, it is always recommended to create a cluster with at-least 3 nodes; that way you can ensure some stability.
In that scenario, even if one server falls out of the cluster, the other two still function correctly as they remain in the majority cluster partition The third one will rejoin the cluster, or will be eventually restarted.
In a scenario where you have only 2 servers as part of the cluster, one falling out from the cluster will result in both the servers being restarted, as they are not a part of the majority cluster partition; this would ultimately result in a very unstable environment.
Another possible scenario is that there was a communication issue between the Managed servers, you should look out for messages like "lost .* message(s)" [in case of unicast it is some thing like "Lost 2 unicast message(s)."] This may be caused due to temporary network issues
Make sure that the node manger for the secondary node in the clustered migration configuration is up and running.

JMS: local broker + HA

There is a cluster of tomcats, each tomcat node generates "tasks" which can be performed by any other node. I'd prefer task to be performed by the node which created it.
I thought that it would be good idea to use an embedded broker for each tomcat and configure it as a store-and-forward network. The problem is that a node can go down and the tasks/messages should then be performed by other tomcat instead of waiting for current one to get up.
On the other hand - when using master/slave cluster how to prioritize the node which sent the message?
How to configure it in activemq?
The priority of a local consumer should be default. In AMQ Docs:
ActiveMQ uses Consumer Priority so that local JMS consumers are always
higher priority than remote brokers in a store and forward network.
However, you will not really achive what you want. If one tomcat node goes down, so will the embedded ActiveMQ (and any messages still attached to that instance). A message will not automatically get copied to all other brokers.
But you ask something about Master/slave cluster. Do you intend to have a network of brokers OR a master/slave setup? Or do you intend to have a combo?

Categories