Application resilience impact using Logstash/Graylog log appender

Application resilience impact using Logstash/Graylog log appender - java

I have some question about the gelf module (http://logging.paluch.biz/) and in particular when the graylog server is not available for some reason.
Is log4j will cache the logs somewhere and will send them when the connection to the graylog is recovered?
Is the application using this module will stop to work during the issue with graylog server?
Thanks.

Gelf-Appenders are online appenders without a cache. They connect directly do a remote service and submit log events as your application produces these.
If the remote service is down, log events get lost. There are a few options with different impacts:
TCP: TCP comes with transport reliability and requires a connection. If the remote service becomes slow/unresponsive, then your application gets affected, as soon as I/O buffer are saturated. logstash-gelf uses NIO in a non-blocking way if all data was sent. If the TCP connection drops, then you will run into connection timeouts, if the remote side is not reachable or connection refused states if the remote port is closed. In any case, you get reliability, but it will affect your application performance.
UDP: UDP has no connection notion, it's used for fire-and-forget communication. If the remote side becomes unhealthy, your application usually is not affected, but you encounter log event loss.
Redis: You can use Redis as an intermediate buffer if your Graylog instance is known to fail/been taken down for maintenance. Once Graylog is available again, it should catch up, and you prevent log event loss to some degree. If your Redis service becomes unhealthy, see Point 1.
HTTP: HTTP is another option that gives you a degree of flexibility. You can put your Graylog servers behind a load-balancer to improve availability and reduce the risk of failure. Log event loss is still possible.
If you want to ensure log continuity and reduce the probability of log event loss, then write logs to disk. It's still no 100% guarantee against loss (disk failure, disk full) but improves application performance. The log file (ideally some JSON-based format) can then be parsed and submitted to Graylog by maintaining a read offset to recover from a remote outage.

Related

How to catch lost connection event in netty channel handler

I'm working on an old app that is using Netty to connect to a couple of remote TCP endpoints.
The app contains an implementation of IdleStateAwareChannelHandler and overrides several methods provided by it and SimpleChannelHandler (channelConnected, channelIdle, messageReceived, exceptionCaught, channelClosed).
This implementation is not able to cope with the scenario where the application server loses connection towards the remote server while my application is running.
I have hoped that introducing my own custom implementation of channelDisconnected() method would allow me to react to connection loss, but in practice I'm seeing something different:
I simulate connection loss by removing my application server from my network, thus cutting it off from both incoming and outgoing traffic
I leave it isolated for 5-10 min and observe the logs
Then I bring back the application server back to network
Only once I have restored the application server to the network I start seeing my debug logs from exceptionCaught and channelDisconnected methods
While the machine is isolated, I see from the logs that channelIdle method is being invoked regularly
Question: Is it possible to isolate and react on connection loss in my channel handler ?
Additional Info:
Netty version: 3.2.7.Final

StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available

We use STOMP broker relay(External Broker - ActiveMQ 5.13.2) in our Project see
https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html#websocket-stomp-handle-broker-relay
We use following stack:
org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final
From time to time(lets say once a 2 weeks) we can observe in tomcat catalina.out logs error
2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn#219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available
After that error STOMP communication is broken(system connection - single TCP connection is not available)
And it seems that everything started when we update stack from:
org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final
ActiveMQ version not changed
There is a bug reported in spring that auto-reconnect failed when the system connection lost see:
https://github.com/spring-projects/spring-framework/issues/22080
And now 3 questions:
How to make this problem more reproducible?
How to fix this reconnect behavior? :)
How to prevent to lose this connection? :)
EDIT 23.09.2019
After error occurred TCP stack for port 61613(STOMP) is the following(Please note CLOSE_WAIT state):
netstat -an | grep 61613
tcp6 0 0 :::61613 :::* LISTEN
tcp6 2 0 127.0.0.1:49084 127.0.0.1:61613 CLOSE_WAIT

I can't say that I have enough information to answer your question although I have some input that may help you find a way forward.
ActiveMQ is typically used in an environment that is hosted/distributed, so load and scaling should always be a consideration.
Most dbs/message queues/ect.. will need some sort of tuning for load - even on AWS (via requesting higher limits) even though most of that is taken care of by the hosting provider.
But I digress...
In this case it appears you're using the TCP transport for your queue:
https://activemq.apache.org/tcp-transport-reference
As you can see, all of these settings can be tuned and have default values.
So in the case of issues logged from the spring side connecting to AMQ, you'll want to narrow down the time of the error and then go look at your AMQ metrics and logs.
If you don't have monitoring for AMQ, I suggest:
Add Monitoring - https://activemq.apache.org/how-can-i-monitor-activemq
Add logging (or find out where the logs are). - Then enable detailed logging. (AMQ uses log4j, so just look at the log4j config file or add one.) Beyond this, consider sending the logs to a log aggregator. -- https://activemq.apache.org/how-can-i-enable-detailed-logging
Look at your hosting provider's metrics & downtime. For instance, if using AWS, there are very detailed incident logs for network failures or momentary issues with VPC or cross-region tunneling, network traffic in/out ect..
Setting up the right tools for your distributed systems to enable your team to search/find errors/logs (and documenting how to do it) is extremely helpful. A step beyond this (for mature systems) is to add a layer on top of your monitoring so that your systems start telling you when there is a problem instead of the other way around (go looking for problems).
That may be a bit verbose - but that all leads up to me asking if you have logs / metrics for the AMQ system at the times of the failure. If you do, please post them!
I make these suggestions because:
There is no information provided on your load expectation, variability of load, or recognition that load is a consideration in a system (via troubleshooting steps).
Logs/errors provided are strictly from the client side.
The reproducibility of the error is infrequent and inconsistent - so it could be almost anything (memory leak, load issue, etc..) - so monitoring is necessary.
Also consider adding Spring Actuator for monitoring your message client on the spring side, as there are frequently limitations/settings for client connection pools & advanced settings too, especially if you scale up/down instance size, etc.. and your instance will be handling more/less load, your client libs may need some settings tuning.
https://www.baeldung.com/spring-boot-actuators
Exposing metrics about current Websocket connections with Spring
You can also catch the exception and tear down & re-create your connection/settings - although this wouldn't be the first thing I recommend without knowing more about the situations & stats at the time of the connection failure.

Elasticsearch unclosed client. Live threads after Tomcat shutdown. Memory usage impact?

I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.

Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.

Sending messages over an unreliable network in JAVA

I need to send a continuous flow of messages (simple TextMessages with a timestamp and x/y coordinates) over a wireless network from a moving computer. There will be a lot of these short messages (like 200 per sec) and unfortunately the network connection is most likely unreliable since the sending device will leave the WLAN area from time to time... When the connection is not available, all upcoming messages should be buffered until the connection is back up again. The order of the transmitted messages does not matter, since they contain a timestamp, but ALL messages must be transferred.
What would be a simple but reliable method for sending these telegrams? Would it be possible to just use a "plain" TCP or UDP socket connection? Would messages be buffered when the connection is temporarily down and send afterwards automatically? Or is the connection loss directly detected and reported, thus I could buffer the messages and try to reconnect periodically on my own? Do libraries like Netty help here?
I also thought about using a broker to broker communication (e.g. ActiveMQ network of brokers) as an alternative. Would the overhead too big here?! Would you suggest another messaging middleware in this case?

TCP is guaranteed delivery (When it's connected that is) - You should check if the connection went down and put messages in a queue while it is retrying the connection. Once it sees that connection is back up dump the queue into the TCP socket.
Also look into TCP Keepalive for recognition of a down connection: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

Seems like you could use a message wrapper like Java JMS using a "Assured persistent" reliability mode. I have not done this myself, in the context of text messages, but this idea may lead you to the right answer. Also, there may be an Apache library already written that handles what you need, such as Qpid .

log4j: How does a Socket Appender work?

I'm not sure how Socket Appender works. I know that logging events are sent to particular port. Then we can print logs on a console or put into a file.
My question is more about the way logs are sent. Is there e.g. one queue? Is it synchronous or asynchronous? Can using it slow down my program?
I've found some info here, but it isn't clear for me.

From the SocketAppender documentation
Logging events are automatically buffered by the native TCP
implementation. This means that if the link to server is slow but
still faster than the rate of (log) event production by the client,
the client will not be affected by the slow network connection.
However, if the network connection is slower then the rate of event
production, then the client can only progress at the network rate. In
particular, if the network link to the the server is down, the client
will be blocked.
On the other hand, if the network link is up, but the
server is down, the client will not be blocked when making log
requests but the log events will be lost due to server unavailability.
Since the appender uses the TCP protocol, I would say the log events are "sort of synchronous".
Basically, the appender uses TCP to send the first log event to the server. However, if the network latency is so high that the message has still not been sent by the time a second event is generated, then the second log event will have to wait (and thus block), until the first event is consumed. So yes, it would slow down your application, if the app generates log events faster than the network can pass them on.
As mentioned by #Akhil and #Nikita, JMSAppender or AsyncAppender would be better options if you don't want the performance of your application to be impacted by the network latency.

Socket Appender sends the logs as a serialized Obect to a SocketNode or log server. In the appender the Connector Thread with a configured reconnectionDelay will check for the connection integrity and will dump all the logs if the connection is not initialized or disconnected.Hence no blocking on the application flow.
If you need better JMS features in sending log info across JVM try JMSAppender.
Log4j JMS appender can be used to send your log messages to JMS
broker.The events are serialized and transmitted as JMS message type ObjectMessage.
You can get a sample program HERE.

It seems to be synchronous (checked sources) but I may be mistaken. You can use AsyncAppender to make it asyncrhonous. See this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.