I have a typical kafka consumer/producer app that is polling all the time for data. Sometimes, there might be no data for hours, but sometimes there could be thousands of messages per second. Because of this, the application is built so it's always polling, with a 500ms duration timeout.
However, I've noticed that sometimes, if the kafka cluster goes down, the consumer client, once started, won't throw an exception, it will simply timeout at 500ms, and continue returning empty ConsumerRecords<K,V>. So, as far as the application is concerned, there is no data to consume, when in reality, the whole Kafka cluster could be unreachable, but the app itself has no idea.
I checked the docs, and I couldn't find a way to validate consumer health, other than maybe closing the connection and subscribing to the topic every single time, but I really don't want to do that on a long-running application.
What's the best way to validate that the consumer is active and healthy while polling, ideally from the same thread/client object, so that the app can distinguish between no data and an unreachable kafka cluster situation?
I am sure this is not the best way to achieve what you are looking for.
But one simple way which I had implemented in my application is by maintaining a static counter in the application indicating emptyRecordSetReceived. Whenever I receive an empty record set by the poll operation I increment this counter.
This counter was emitted to the Graphite at periodic interval (say every minute) with the help of the Metric registry from the application.
Now let's say you know the maximum time frame for which the message will not be available to consume by this application. For example, say 6 hours. Given that you are polling every 500 Millisecond, you know that if we do not receive the message for 6 hours, the counter would increase by
2 poll in 1 second * 60 seconds * 60 minutes * 6 hours = 43200.
We had placed an alerting check based on this counter value reported to Graphite. This metric used to give me a decent idea if it is a genuine problem from the application or something else is down from the Broker or producer side.
This is just the naive way I had solved this use case to some extent. I would love to hear how it is actually done without maintaining these counters.
I am using the default heartbeat interval of 30 seconds. Has anybody experience any load issues using that default interval?
It depends on the number of reads your app does per given window. If you are sure that your app is going to keep knocking the database with queries then you can go with a longer interval or even disable it by setting it to 0. If you have a on and off kind of load pattern ( like your get request for 2mins and no request for 2 next two mins) then having the heart beat is critical in order to keep the connection active. Otherwise making a new connection would be costly.
We have a Java 8 application served by an Apache Tomcat 8 behind an Apache server, which is requesting multiple webservices in parallel using CXF. From time to time, there's one of them which lasts exactly 3 more seconds than the rest (which should be about 500ms only).
I've activated CXF debug, and now I have the place inside CXF where the 3 seconds are lost:
14/03/2018 09:20:49.061 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - No Trust Decider for Conduit '{http://ws.webapp.com/}QueryWSImplPort.http-conduit'. An affirmative Trust Decision is assumed.
14/03/2018 09:20:52.077 [pool-838-thread-1] DEBUG o.a.cxf.transport.http.HTTPConduit - Sending POST Message with Headers to http://172.16.56.10:5050/services/quertServices Conduit :{http://ws.webapp.com/}QueryWSImplPort.http-conduit
As you could see, there're three seconds between these two lines. When the request is ok, it usually takes 0ms in between these two lines.
I've been looking into the CXF code, but no clue about the reason of these 3 secs...
The server application (which is also maintained by us), is served from another Apache Tomcat 6.0.49, which is behind another Apache server. The thing is that it seems that the server's Apache receives the request after the 3 seconds.
Anyone could help me?
EDIT:
We've monitored the server's send/received packages and it seems that the client's server is sending a negotiating package at the time it should, while the server is replying after 3 seconds.
These are the packages we've found:
481153 11:31:32 14/03/2018 2429.8542795 tomcat6.exe SOLTESTV010 SOLTESTV002 TCP TCP:Flags=CE....S., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646321, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:5513, IPv4:62}
481686 11:31:35 14/03/2018 2432.8608381 tomcat6.exe SOLTESTV002 SOLTESTV010 TCP TCP:Flags=...A..S., SrcPort=5050, DstPort=65160, PayloadLen=0, Seq=436586023, Ack=2858646322, Win=8192 ( Negotiated scale factor 0x8 ) = 2097152 {TCP:5513, IPv4:62}
481687 11:31:35 14/03/2018 2432.8613607 tomcat6.exe SOLTESTV010 SOLTESTV002 TCP TCP:Flags=...A...., SrcPort=65160, DstPort=5050, PayloadLen=0, Seq=2858646322, Ack=436586024, Win=256 (scale factor 0x8) = 65536 {TCP:5513, IPv4:62}
481688 11:31:35 14/03/2018 2432.8628380 tomcat6.exe SOLTESTV010 SOLTESTV002 HTTP HTTP:Request, POST /services/consultaServices {HTTP:5524, TCP:5513, IPv4:62}
So, it seems is the server's Tomcat is the one which is blocked with something. Any clue?
EDIT 2:
Although that happened yesterday (the first server waiting 3s for the ack of the second), this is not the most common scenario. What it usually happens is what I described at the beginning (3 seconds between the two CXF's logs and the server receiving any request from the first one after 3 seconds.
There has been some times when the server (the one which receives the request), hangs for 3 seconds. For instance:
Server 1 sends 5 requests at the same time (suppossedly) to server 2.
Server 2 receives 4 of them, in that same second, and start to process them.
Server 2 finish processing 2 of those 4 requests in 30ms and replies to server 1.
More or less at this same second, there's nothing registered in the application logs.
After three seconds, logs are registered again, and the server finish to process the remaining 2 requests. So, although the process itself is about only some milliseconds, the response_time - request_time is 3 seconds and a few ms.
At this same time, the remaining request (the last one from the 5 request which were sent), is registered in the network monitor and is processed by the application in just a few milliseconds. However, the global processing time is anything more than 3s, as it has reached the server 3 seconds after being sent.
So there's like a hang in the middle of the process. 2 requests were successfully processed before this hang and replied in just a fraction of a second. 2 other request lasted a little bit more, the hang happened, and ended with a processing time of 3 seconds. The last one, reached the server just when the hang happened, so it didn't get into the application after the hang.
It sounds like a gc stop the world... but we have analyzed gc.logs and there's nothing wrong with that... could there be any other reason?
Thanks!
EDIT 3:
Looking at the TCP flags like the one I pasted last week, we've noticed that there are lots of packets with the CE flag, that is a notification of TCP congestion. We're not network experts, but have found that this could deal us to a 3 seconds of delay before the packet retransmission...
could anyone give us some help about that?
Thanks. Kind regards.
Finally, it was everything caused by the network congestion we discovered looking at the TCP flags. Our network admins has been looking at the problem, trying to reduce the congestion, reducing the timeout to retransmit.
The thing is that it seems that the server's Apache receives the
request after the 3 seconds.
How do you figure this out ? If you're looking at Apache logs, you can be misleaded by wrong time stamps.
I first thought that your Tomcat 6 takes 3 seconds to answer instead of 0 to 500ms, but from the question and the comments, it is not the case.
Hypothesis 1 : Garbage Collector
The GC is known for introducing latency.
Highlight the GC activity in your logs by using the GC verbosity parameters. If it is too difficult to correlate, you can use the jstat command with the gcutil option and you can compare easily with the Tomcat's log.
Hypothesis 2 : network timeout
Although 3s is a very short time (in comparison with the 21s TCP default timeout on Windows for example), it could be a timeout.
To track the timeouts, you can use the netstat command. With netstat -an , look for the SYN_SENT connections, and with netstat -s look for the error counters. Please check if there is any network resource that must be resolved or accessed in this guilty webservice caller.
In Google Appengine documentation it says that tasks are limited to 10 minutes. However when I run deferred tasks they die in 60 seconds. I couldn't find anywhere this to be mentioned.
Does it mean that Appengine deferred tasks are limited to 60 seconds, or maybe I am doing something wrong?
UPDATE: The first task is triggered from request, but I am not waiting for it to return (and how could I anyway, there are no callbacks). The subsequent ones
I am triggering, kind of recursively, from within the task itself.
DeferredTask df = new QuoteReader(params);
QueueFactory.getDefaultQueue().add(withPayload(df));
Many of them just work, but for the ones which reach 1 minute limit I get ApiProxy$ApiDeadlineExceededException
com.googlecode.objectify.cache.Pending completeAllPendingFutures: Error cleaning up pending Future: com.googlecode.objectify.cache.CachingAsyncDatastoreService$3#17f5ddc
java.util.concurrent.ExecutionException: com.google.apphosting.api.ApiProxy$ApiDeadlineExceededException: The API call datastore_v3.Get() took too long to respond and was cancelled.
Another thing I noticed, this affects the other request to that server happening at the same time and that goes down with DeadlineExceededException.
The error is coming from a Datastore operation that is exceeding 60s. It's not really related to Taskqueue deadlines as such. You are correct that they are 10 minutes (see here)
However as per Old related issue (maybe it changed to 60s since)
From Google: Even though offline requests can currently live up to 10 minutes (and background instances can live forever) datastore queries can still only live for 30 seconds.
It seems from the exception that your code completed and it's Objectify (later in the request filters) that's actually where the timeout occurs. I'd suggest you split up your data operations so datastore queries are quicker and if necessary use .now() on your data operations so exceptions occur in your code.
I'm new with web service. I'm using Oracle Jdeveloper to call web service to request data. I need to call the function they provided for about 40,000 times. I use a while loop and it erratically returns "504 gateway time-out" error or "500 server error". By "erratically" I mean sometimes it caught exception after 500 calls but sometimes just after several calls or even at the beginning.
What I've tried: reduce the total number of calls but increase the requested data amount for each call. But the result seems like I got the error more frequently (after 2 or 3 queries).
My question is: How does the server count time? Is the time-out server error related to "time" or to "frequency"? Is there a way to avoid this error?
Solution
I just reduced the call frequency and when this error popped up, asked for waiting and recall after 30 sec.