Java memory leak with AWS CloudWatch - java

I'm using AWS CloudWatch PutLogEvents to store client side logs into the CloudWatch and I'm using this reference
https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html
Below is the sample implementation that I'm using for the PutLogEvents and I'm using a foreach to push the list of InputLogEvents into the CloudWatch.
https://docs.aws.amazon.com/code-samples/latest/catalog/javav2-cloudwatch-src-main-java-com-example-cloudwatch-PutLogEvents.java.html
After continuously pushing logs for about one or two hours the service gets crashed and gets out of memory error. But when I run the application locally it won't get crashed and run more than one or two hours. But I haven't continuously run it on locally to check the crashing time. I was wondering is there any memory leak issue with AWS CloudWatch implementation. Because before the implementation the service works for months without any heap issues.
Here is my service implementation and through the controller, I'm calling this putLogEvents method. My only suspect area is the cwLogDto for each loop and in there I'm assigning the inputLogEvent as null. So it should be garbage collected. Anyway, I have tested this without sending the list of dtos and still, I'm getting the same OOM error.
#Override
public TransportDto putLogEvents(List<CWLogDto> cwLogDtos, UserType userType) throws Exception {
TransportDto transportDto = new TransportDto();
CloudWatchLogsClient logsClient = CloudWatchLogsClient.builder().region(Region.of(region))
.build();
PutLogEventsResponse putLogEventsResponse = putCWLogEvents(logsClient, userType, cwLogDtos);
logsClient.close();
transportDto.setResponse(putLogEventsResponse.sdkHttpResponse());
return transportDto;
}
private PutLogEventsResponse putCWLogEvents(CloudWatchLogsClient logsClient, UserType userType, List<CWLogDto> cwLogDtos) throws Exception{
DateTimeFormatter formatter = DateTimeFormatter.ofPattern(logStreamPattern);
String streamName = LocalDateTime.now().format(formatter);
String logGroupName = logGroupOne;
if(userType.equals(UserType.TWO))
logGroupName =logGroupTwo;
log.info("Total Memory before (in bytes): {}" , Runtime.getRuntime().totalMemory());
log.info("Free Memory before (in bytes): {}" , Runtime.getRuntime().freeMemory());
log.info("Max Memory before (in bytes): {}" , Runtime.getRuntime().maxMemory());
DescribeLogStreamsRequest logStreamRequest = DescribeLogStreamsRequest.builder()
.logGroupName(logGroupName)
.logStreamNamePrefix(streamName)
.build();
DescribeLogStreamsResponse describeLogStreamsResponse = logsClient.describeLogStreams(logStreamRequest);
// Assume that a single stream is returned since a specific stream name was specified in the previous request. if not will create a new stream
String sequenceToken = null;
if(!describeLogStreamsResponse.logStreams().isEmpty()){
sequenceToken = describeLogStreamsResponse.logStreams().get(0).uploadSequenceToken();
describeLogStreamsResponse = null;
}
else{
CreateLogStreamRequest request = CreateLogStreamRequest.builder()
.logGroupName(logGroupName)
.logStreamName(streamName)
.build();
logsClient.createLogStream(request);
request = null;
}
// Build an input log message to put to CloudWatch.
List<InputLogEvent> inputLogEventList = new ArrayList<>();
for (CWLogDto cwLogDto : cwLogDtos) {
InputLogEvent inputLogEvent = InputLogEvent.builder()
.message(new ObjectMapper().writeValueAsString(cwLogDto))
.timestamp(System.currentTimeMillis())
.build();
inputLogEventList.add(inputLogEvent);
inputLogEvent = null;
}
log.info("Total Memory after (in bytes): {}" , Runtime.getRuntime().totalMemory());
log.info("Free Memory after (in bytes): {}" , Runtime.getRuntime().freeMemory());
log.info("Max Memory after (in bytes): {}" , Runtime.getRuntime().maxMemory());
// Specify the request parameters.
// Sequence token is required so that the log can be written to the
// latest location in the stream.
PutLogEventsRequest putLogEventsRequest = PutLogEventsRequest.builder()
.logEvents(inputLogEventList)
.logGroupName(logGroupName)
.logStreamName(streamName)
.sequenceToken(sequenceToken)
.build();
inputLogEventList = null;
logStreamRequest = null;
return logsClient.putLogEvents(putLogEventsRequest);
}
CWLogDto
#Data
#JsonInclude(JsonInclude.Include.NON_NULL)
public class CWLogDto {
private Long userId;
private UserType userType;
private LocalDateTime timeStamp;
private Integer logLevel;
private String logType;
private String source;
private String message;
}
Heap dump summary
Any help will be greatly appreciated.

Related

How to drain the window after a Flink join using coGroup()?

I'd like to join data coming in from two Kafka topics ("left" and "right").
Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout. Therefore I have chosen to use the coGroup function.
This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good. It gets pushed out when new messages arrive. Otherwise it is stuck.
The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.
Some information which might be relevant
Flink 1.14.4
The Flink parallelism is set to 8, so is the number of partitions in both Kafka topics.
Flink checkpointing is enabled
Event-time processing is to be used
Lombok is used: So val is like final var
Some code snippets:
Relevant join settings
public static final int AUTO_WATERMARK_INTERVAL_MS = 500;
public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);
Create KafkaSource
private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
val properties = KafkaConfigUtils.createConsumerConfig(config);
val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
val m = JsonUtils.deserialize(record.value(), JoinRecord.class);
val copy = m.toBuilder()
.partition(record.partition())
.build();
out.collect(copy);
}
#Override
public TypeInformation<JoinRecord> getProducedType() {
return TypeInformation.of(JoinRecord.class);
}
};
return KafkaSource.<JoinRecord>builder()
.setProperties(properties)
.setBootstrapServers(config.kafkaBootstrapServers)
.setTopics(topic)
.setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
.setDeserializer(deserializationSchema)
.setStartingOffsets(OffsetsInitializer.latest())
.build();
}
Create DataStreamSource
Then the DataStreamSource is built on top of the KafkaSource:
Configure "max out of orderness"
Configure "idleness"
Extract timestamp from record, to be used for event time processing
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
StreamExecutionEnvironment env) {
val leftKafkaSource = createLeftKafkaSource(config);
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
.withIdleness(SOURCE_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);
return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}
Use keyBy
The keyed sources are created on top of the DataSource instances like this:
Again configure "out of orderness" and "idleness"
Again extract timestamp
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS)
.withIdleness(TRANSFORMATION_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> {
if (VERBOSE_JOIN)
log.info("Left : " + joinRecord);
return joinRecord.timestamp.toEpochSecond() * 1000L;
});
val leftKeyedSource = leftSource
.keyBy(jr -> jr.id)
.assignTimestampsAndWatermarks(leftWms)
.name("left-keyed-source");
Join using coGroup
The join then combines the left and the right keyed sources
val joinedStream = leftKeyedSource
.coGroup(rightKeyedSource)
.where(left -> left.id)
.equalTo(right -> right.id)
.window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
.apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
#Override
public void coGroup(Iterable<JoinRecord> leftRecords,
Iterable<JoinRecord> rightRecords,
Collector<JoinRecord> out) {
// Transform
val result = ...;
out.collect(result);
}
Write stream to console
The resulting joinedStream is written to the console:
val consoleSink = new PrintSinkFunction<JoinRecord>();
joinedStream.addSink(consoleSink);
How can I configure this join operation, so that all records are pushed downstream after the configured idle timeout?
If it can't be done this way: Is there another option?
This is the expected behavior. withIdleness doesn't try to handle the case where all streams are idle. It only helps in cases where there are still events flowing from at least one source partition/shard/split.
To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer. Here's an implementation that uses the legacy watermark API.
On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain option when you stop the job. Or if you use bounded sources this will happen automatically.

FlinkCEP pattern detection doesn't happen in real time

I'm still new to Flink CEP library and yet I don't understand the pattern detection behavior.
Considering the example below, I have a Flink app that consumes data from a kafka topic, data is produced periodically, I want to use Flink CEP pattern to detect when a value is bigger than a given threshold.
The code is below:
public class CEPJob{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
consumer.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
DataStream<String> stream = env.addSource(consumer);
// Process incoming data.
DataStream<Stock> inputEventStream = stream.map(new MapFunction<String, Stock>() {
private static final long serialVersionUID = -491668877013085114L;
#Override
public Stock map(String value) {
String[] data = value.split(":");
System.out.println("Date: " + data[0] + ", Adj Close: " + data[1]);
Stock stock = new Stock(data[0], Double.parseDouble(data[1]));
return stock;
}
});
// Create the pattern
Pattern<Stock, ?> myPattern = Pattern.<Stock>begin("first").where(new SimpleCondition<Stock>() {
private static final long serialVersionUID = -6301755149429716724L;
#Override
public boolean filter(Stock value) throws Exception {
return (value.getAdj_Close() > 140.0);
}
});
// Create a pattern stream from our warning pattern
PatternStream<Stock> myPatternStream = CEP.pattern(inputEventStream, myPattern);
// Generate alert for each matched pattern
DataStream<Stock> warnings = myPatternStream .select((Map<String, List<Stock>> pattern) -> {
Stock first = pattern.get("first").get(0);
return first;
});
warnings.print();
env.execute("CEP job");
}
}
What happens when I run the job, pattern detection doesn't happen in real-time, it outputs the warning for the detected pattern of the current record only after a second record is produced, it looks like it's delayed to print to the log the warining, I really didn't understand how to make it outputs the warning the time it detect the pattern without waiting for next record and thank you :) .
Data coming from Kafka are in string format: "date:value", it produce data every 5 secs.
Java version: 1.8, Scala version: 2.11.12, Flink version: 1.12.2, Kafka version: 2.3.0
The solution I found that to send a fake record (a null object for example) in the Kafka topic every time I produce a value to the topic, and on the Flink side (in the pattern declaration) I test if the received record is fake or not.
It seems like FlinkCEP always waits for the upcoming event before it outputs the warning.

DropwizardMetricServices doesn't submit the gauge metric to JMX for second time (after removing the first time)

DropwizardMetricServices#submit() I'm using doesn't submit the gauge metric for second time.
i.e. My use-case is to remove the gauge metric from JMX after reading it. And my application can send the same metric (with different value).
For the first time the gauge metric is submitted successfully (then my application removes it once it reads the metric). But, the same metric is not submitted the second time.
So, I'm a bit confused what would be the reason for DropwizardMetricServices#submit() not to work for the second time?
Below is the code:
Submit metric:
private void submitNonSparseMetric(final String metricName, final long value) {
validateMetricName(metricName);
metricService.submit(metricName, value); // metricService is the DropwizardMetricServices
log(metricName, value);
LOGGER.debug("Submitted the metric {} to JMX", metricName);
}
Code that reads and removes the metric:
protected void collectMetrics() {
// Create the connection
Long currTime = System.currentTimeMillis()/1000; // Graphite needs
Socket connection = createConnection();
if (connection == null){
return;
}
// Get the output stream
DataOutputStream outputStream = getDataOutputStream(connection);
if (outputStream == null){
closeConnection();
return;
}
// Get metrics from JMX
Map<String, Gauge> g = metricRegistry.getGauges(); // metricRegistry is com.codahale.metrics.MetricRegistry
for(Entry<String, Gauge> e : g.entrySet()){
String key = e.getKey();
if(p2cMetric(key)){
String metricName = convertToMetricStandard(key);
String metricValue = String.valueOf(e.getValue().getValue());
String metricToSend = String.format("%s %s %s\n", metricName, metricValue, currTime);
try {
writeToStream(outputStream, metricToSend);
// Remove the metric from JMX after successfully sending metric to graphite
removeMetricFromJMX(key);
} catch (IOException e1) {
LOGGER.error("Unable to send metric to Graphite - {}", e1.getMessage());
}
}
}
closeOutputStream();
closeConnection();
}
I think I found the issue.
As per the DropwizardMetricServices doc - https://docs.spring.io/spring-boot/docs/current/api/org/springframework/boot/actuate/metrics/dropwizard/DropwizardMetricServices.html#submit-java.lang.String-double- ,
submit() method Set the specified gauge value.
So, I think it's recommended to use DropwizardMetricServices#submit() method to only set the values of any existing gauge metric in JMX and not for adding any new metric to JMX.
So, once I replaced DropwizardMetricServices#submit() with MetricRegistry#register() (com.codahale.metrics.MetricRegistry) method to submit all my metrics it worked as expected and my metrics are readded to JMX (once they were removed by my application).
But, I'm just wondering what makes DropwizardMetricServices#submit() to only add new metrics to JMX and not any metric that's already been removed (from JMX). Does DropwizardMetricServices cache (in memory) all the metrics submitted to JMX? that makes DropwizardMetricServices#submit() method not to resubmit the metric?

Apache Curator - Zookeeper connection loss exception, possible memory leak

I have been working on a process that continuously monitors a distributed atomic long counter. It monitors it every minute using the following class ZkClient's method getCounter. In fact, I have multiple threads running each of which are monitoring a different counter (distributed atomic long) stored in the Zookeeper nodes. Each thread specifies the path of the counter via the parameters of the getCounter method.
public class TagserterZookeeperManager {
public enum ZkClient {
COUNTER("10.11.18.25:2181"); // Integration URL
private CuratorFramework client;
private ZkClient(String servers) {
Properties props = TagserterConfigs.ZOOKEEPER.getProperties();
String zkFromConfig = props.getProperty("servers", "");
if (zkFromConfig != null && !zkFromConfig.isEmpty()) {
servers = zkFromConfig.trim();
}
ExponentialBackoffRetry exponentialBackoffRetry = new ExponentialBackoffRetry(1000, 3);
client = CuratorFrameworkFactory.newClient(servers, exponentialBackoffRetry);
client.start();
}
public CuratorFramework getClient() {
return client;
}
}
public static String buildPath(String ... node) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < node.length; i++) {
if (node[i] != null && !node[i].isEmpty()) {
sb.append("/");
sb.append(node[i]);
}
}
return sb.toString();
}
public static DistributedAtomicLong getCounter(String taskType, int hid, String jobId, String countType) {
String path = buildPath(taskType, hid+"", jobId, countType);
Builder builder = PromotedToLock.builder().lockPath(path + "/lock").retryPolicy(new ExponentialBackoffRetry(10, 10));
DistributedAtomicLong count = new DistributedAtomicLong(ZkClient.COUNTER.getClient(), path, new RetryNTimes(5, 20), builder.build());
return count;
}
}
From within the threads, this is how I am calling this method:
DistributedAtomicLong counterTotal = TagserterZookeeperManager
.getCounter("testTopic", hid, jobId, "test");
Now it seems like after the threads have run for a few hours, at one stage I start getting the following org.apache.zookeeper.KeeperException$ConnectionLossException exception inside the getCounter method where it tries to read the count:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /contentTaskProd
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at org.apache.curator.utils.ZKPaths.mkdirs(ZKPaths.java:215)
at org.apache.curator.utils.EnsurePath$InitialHelper$1.call(EnsurePath.java:148)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.utils.EnsurePath$InitialHelper.ensure(EnsurePath.java:141)
at org.apache.curator.utils.EnsurePath.ensure(EnsurePath.java:99)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.getCurrentValue(DistributedAtomicValue.java:254)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicValue.get(DistributedAtomicValue.java:91)
at org.apache.curator.framework.recipes.atomic.DistributedAtomicLong.get(DistributedAtomicLong.java:72)
...
I keep getting this exception from thereon for a while and I get the feeling it is causing some internal memory leaks that eventually causes an OutOfMemory error and the whole process bails out. Does anybody have any idea what the reason for this could be? Why would Zookeeper suddenly start throwing the connection loss exception? After the process bails out, I can manually connect to Zookeeper through another small console program that I have written (also using curator) and all look good there.
In order to monitor a node in Zookeeper using curator you can use the NodeCache this won't solve your connection problems.... but instead of polling the node once a minute you can get a push event when it changes.
In my experience, the NodeCache handles quite well disconnection and resume of connections.

elasticsearch java bulk batch size

I want to use the elasticsearch bulk api using java and wondering how I can set the batch size.
Currently I am using it as:
BulkRequestBuilder bulkRequest = getClient().prepareBulk();
while(hasMore) {
bulkRequest.add(getClient().prepareIndex(indexName, indexType, artist.getDocId()).setSource(json));
hasMore = checkHasMore();
}
BulkResponse bResp = bulkRequest.execute().actionGet();
//To check failures
log.info("Has failures? {}", bResp.hasFailures());
Any idea how I can set the bulk/batch size?
It mainly depends on the size of your documents, available resources on the client and the type of client (transport client or node client).
The node client is aware of the shards over the cluster and sends the documents directly to the nodes that hold the shards where they are supposed to be indexed. On the other hand the transport client is a normal client that sends its requests to a list of nodes in a round-robin fashion. The bulk request would be sent to one node then, which would become your gateway when indexing.
Since you're using the Java API, I would suggest you to have a look at the BulkProcessor, which makes it much easier and flexibile to index in bulk. You can either define a maximum number of actions, a maximum size and a maximum time interval since the last bulk execution. It's going to execute the bulk automatically for you when needed. You can also set a maximum number of concurrent bulk requests.
After you created the BulkProcessor like this:
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
#Override
public void beforeBulk(long executionId, BulkRequest request) {
logger.info("Going to execute new bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
logger.info("Executed bulk composed of {} actions", request.numberOfActions());
}
#Override
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
logger.warn("Error executing bulk", failure);
}
}).setBulkActions(bulkSize).setConcurrentRequests(maxConcurrentBulk).build();
You just have to add your requests to it:
bulkProcessor.add(indexRequest);
and close it at the end to flush any eventual requests that might have not been executed yet:
bulkProcessor.close();
To finally answer your question: the nice thing about the BulkProcessor is also that it has sensible defaults: 5 MB of size, 1000 actions, 1 concurrent request, no flush interval (which might be useful to set).
you need to count your bulk request builder when it hits your batch size limit then index them and flush older bulk builds .
here is example of code
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "MyClusterName").build();
TransportClient client = new TransportClient(settings);
String hostname = "myhost ip";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(hostname, port));
BulkRequestBuilder bulkBuilder = client.prepareBulk();
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("my_file_path"))));
long bulkBuilderLength = 0;
String readLine = "";
String index = "my_index_name";
String type = "my_type_name";
String id = "";
while((readLine = br.readLine()) != null){
id = somefunction(readLine);
String json = new ObjectMapper().writeValueAsString(readLine);
bulkBuilder.add(client.prepareIndex(index, type, id).setSource(json));
bulkBuilderLength++;
if(bulkBuilderLength % 1000== 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
hope this helps you
thanks

Categories