QFJ Passes Messages in the Wrong Order - java

I am using QFJ 2.1.1 and testing my application against a fix simulator (also running in QFJ 2.1.1)
There are two fix sessions.
An initiator and acceptor on both sides.
The problem workflow looks like this:
1. simulator/acceptor <--- New Order Single <--- application/initiator
2. simulator/acceptor ---> ACK ---> application/initiator
3. simulator/initiator ---> New Order Single ---> application/acceptor
4. simulator/initiator <--- ACK <--- application/acceptor
The order of fix messages processed by QFJ in the simulator is 1,2,3,4
The order of fix messages processed by QFJ in the application is 1,3,2,4
The Application was called back with 3 (NewOrderSingle) before it was called back with 2 (ACK)
Here are the QFJ log snippets from the simulator showing 2 and 3 being sent on two sessions:
2
2019-12-12 10:23:12.820 [928630][QFJ Message Processor][INFO ] <-- OUTBOUND VENDOR: 8=FIX.4.2|9=180|35=8|34=2|
52=20191212-15:23:12.820|11=287:MACGREGOR-37392703:45037843|17=BYHWG|20=0|37=SIM:287:MACGREGOR-37392703:45037843|38=10000|39=0|54=2|55=MSFT|150=0|10=132|[:]
3
2019-12-12 10:23:12.820 [928630][QFJ Message Processor][INFO ] <-- OUTBOUND ATS: 8=FIX.4.2|9=208|35=D|34=2|52=20191212-15:23:12.820|11=GSET:287:MACGREGOR-37392703:45037843|18=M|21=1|38=10000|40=P|44=153.3
500|54=2|55=MSFT|60=20191212-15:23:09.205|110=0|8011=287:MACGREGOR-37392703:45037843|10=207|[:]
Here are the QFJ log snippets from the application showing 3 being received before 2 on the two sessions:
3
2019-12-12 10:23:12.824 [31181][QFIXManager][INFO ] FIX onAppReceived(): AQUA->GSET, message=[11=GSET:287:MACG
REGOR-37392703:45037843 35=D 18=M 44=153.3500] {11=GSET:287:MACGREGOR-37392703:45037843, 44=153.3500, 55=MSFT,
34=2, 56=AQUA, 35=D, 8011=287:MACGREGOR-37392703:45037843, 49=GSET, 38=10000, 18=M, 110=0, 8=FIX.4.2, 9=208,
60=20191212-15:23:09.205, 40=P, 52=20191212-15:23:12.820, 21=1, 54=2, 10=207} [:]
2
2019-12-12 10:23:12.827 [31184][QFIXManager][INFO ] FIX onAppReceived(): AQUABORG->GSETBORG, message=[11=287:M
ACGREGOR-37392703:45037843 37=SIM:287:MACGREGOR-37392703:45037843 35=8 39=0 150=0] {11=287:MACGREGOR-37392703:
45037843, 55=MSFT, 34=2, 56=AQUABORG, 35=8, 37=SIM:287:MACGREGOR-37392703:45037843, 49=GSETBORG, 38=10000, 17=
BYHWG, 39=0, 150=0, 8=FIX.4.2, 9=180, 52=20191212-15:23:12.820, 20=0, 54=2, 10=132}
As you can see, the application messages 2 and 3 are received out of order.
How can I prevent this in my QFJ application?

Related

Understanding AWS FIFO Queue behaviour

Was playing around with AWS SQS FIFO Queue locally in localstack with AWS Java sdk v2 & Spring Boot.
I created one endpoint to send messages through one publisher and three endpoints to receive/poll messages from queue via three consumers in Spring boot controller classes.
I created the FIFO queue with following properties -
RECEIVE_MESSAGE_WAIT_TIME_SECONDS = 20 seconds (long poll)
VISIBILITY_TIMEOUT = 60 seconds
FIFO_QUEUE = true
CONTENT_BASED_DEDUPLICATION = true
Each consumer could fetch at max 3 messages (At least 1 if available, up-to 3) per each poll request.
I published 5 messages to the queue (in order). They are -
Message Group Id | Deduplication Id
-----------------------------------
A | A1
A | A2
A | A3
A | A4
A | A5
From log -
2022-06-01 16:13:26.474 INFO 27918 --- [nio-9099-exec-1] c.p.sqs.service.SqsPublisherServiceImpl : sendMsgRequest SendMessageRequest(QueueUrl=http://localhost:4566/000000000000/dev-priyam-fifo-queue.fifo, MessageBody={"id":"A1"}, MessageDeduplicationId=A1, MessageGroupId=A)
2022-06-01 16:13:26.600 INFO 27918 --- [nio-9099-exec-2] c.p.sqs.service.SqsPublisherServiceImpl : sendMsgRequest SendMessageRequest(QueueUrl=http://localhost:4566/000000000000/dev-priyam-fifo-queue.fifo, MessageBody={"id":"A2"}, MessageDeduplicationId=A2, MessageGroupId=A)
2022-06-01 16:13:26.700 INFO 27918 --- [nio-9099-exec-3] c.p.sqs.service.SqsPublisherServiceImpl : sendMsgRequest SendMessageRequest(QueueUrl=http://localhost:4566/000000000000/dev-priyam-fifo-queue.fifo, MessageBody={"id":"A3"}, MessageDeduplicationId=A3, MessageGroupId=A)
2022-06-01 16:13:26.785 INFO 27918 --- [nio-9099-exec-4] c.p.sqs.service.SqsPublisherServiceImpl : sendMsgRequest SendMessageRequest(QueueUrl=http://localhost:4566/000000000000/dev-priyam-fifo-queue.fifo, MessageBody={"id":"A4"}, MessageDeduplicationId=A4, MessageGroupId=A)
2022-06-01 16:13:26.843 INFO 27918 --- [nio-9099-exec-5] c.p.sqs.service.SqsPublisherServiceImpl : sendMsgRequest SendMessageRequest(QueueUrl=http://localhost:4566/000000000000/dev-priyam-fifo-queue.fifo, MessageBody={"id":"A5"}, MessageDeduplicationId=A5, MessageGroupId=A)
I then started polling from consumers randomly. My observation is stated below -
A1, A2 and A3 were polled. They were polled but not deleted (intentionally). So they went back to the queue after visibility timeout (60 seconds) was over.
In the next poll, A3 and A4 were polled. Again, they were polled but not deleted. So they went back to the queue after 60 seconds.
In the next poll, A4 and A5 were polled. Again, they were polled but not deleted. So they went back to the queue after 60 seconds.
In the next poll (and all following polls) A5 was polled. And I kept getting only A5 from here on.
Now I want to understand why I am getting this behaviour. The whole selling point of FIFO is getting ordered messages (per same message group id). My expectation after step 1 was, I will get one of A1, A1 A2 or A1 A2 A3 in the next poll (step 2) - but this didn't happen.
Can anyone explain what is happening here?
My github repo : https://github.com/tahniat-ashraf/java-aws-sqs-101
I believe this is a known issue in localstack, when using both CONTENT_BASED_DEDUPLICATION=true and providing a MessageDeduplicationId.
SQS supports either content-based duplication, or manual deduplication via a deduplication ID. It does not support both.
Try running this on an actual SQS queue - or change your configuration as described in the localstack issue.

After enabling checkpoint - org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException

I have enabled checkpoint in flink 1.12.1 programmatically as below:
int duration = 10 ;
if (!environment.getCheckpointConfig().isCheckpointingEnabled()) {
environment.enableCheckpointing(duration * 6 * 1000, CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(duration * 3 * 1000);
}
Flink Version: 1.12.1
configuration:
state.backend: rocksdb
state.checkpoints.dir: file:///flink/
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
taskmanager.data.port: 6121
classloader.resolve-order: parent-first
execution.checkpointing.unaligned: false
execution.checkpointing.max-concurrent-checkpoints: 2
execution.checkpointing.interval: 60000
But it is failing with following error:
Caused by: org.apache.flink.runtime.resourcemanager.exceptions.UnfulfillableSlotRequestException: Could not fulfill slot request 44ec308e34aa86629d2034a017b8ef91. Requested resource profile (ResourceProfile{UNKNOWN}) is unfulfillable.
If I remove/disable checkpoint, everything works normally. I have checkpoint requirement because, if my pod, gets restart, data which is being handled by earlier run gets reset.
Can somebody direct, how this can be addressed?

Error occurred in partition processor while running 2 instances of java sdk app to listen event hub

dependencies getting used
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-messaging-eventhubs</artifactId>
<version>5.1.2</version>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-messaging-eventhubs-checkpointstore-blob
</artifactId>
<version>1.1.2</version>
</dependency>
I have a eventhub with 4 partition.
I am producing messages continuously on the eventhub.
and listening with EventProcessorClient.
Initially 1 instance is up and listening all the 4 partitions.
In between i am starting the 2nd instance with same consumer group on the same machine. My expectation was partition will automatically get re balances and partition ownership will get distributed to 2 listeners (having same consumer group).
But i am facing 2 issues
both instances are listening same messages of the same partition (contrast to 1 parition 1 consumer ownership).
getting below error and similar error continously.
Error occurred in partition processor for partition 3, com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:70cb6fce00000f030023f5515f0f1e63_G27_B21, SystemTracker:......:test~32766|$default, Timestamp:2020-07-15T15:19:19, errorContext[NAMESPACE: ......, PATH: test/ConsumerGroups/$default/Partitions/3, REFERENCE_ID: ......, LINK_CREDIT: 499].
Can some one help me out in the understanding and results of the issues.

The messages are not getting deleted from the file system when deleteRecords Kafka Admin Client Java API is invoked

I was trying to delete messages from my kafka topic using Java Admin Client API's delete Records method. Following are the steps that i have tried
1. I pushed 20000 records to my TEST-DELETE topic
2. Started a console consumer and consumed all the messages
3. Invoked my java program to delete all those 20k messages
4. Started another console consumer with a different group id. This consumer is not receiving any of the deleted messages
When I checked the file system, I could still see all those 20k records occupying the disk space. My intention is to delete those records forever from file system too.
My Topic configuration is given below along with server.properties settings
Topic:TEST-DELETE PartitionCount:4 ReplicationFactor:1 Configs:cleanup.policy=delete
Topic: TEST-DELETE Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: TEST-DELETE Partition: 3 Leader: 0 Replicas: 0 Isr: 0
log.retention.hours=24
log.retention.check.interval.ms=60000
log.cleaner.delete.retention.ms=60000
file.delete.delay.ms=60000
delete.retention.ms=60000
offsets.retention.minutes=5
offsets.retention.check.interval.ms=60000
log.cleaner.enable=true
log.cleanup.policy=compact,delete
My delete code is given below
public void deleteRecords(Map<String, Map<Integer, Long>> allTopicPartions) {
Map<TopicPartition, RecordsToDelete> recordsToDelete = new HashMap<>();
allTopicPartions.entrySet().forEach(topicDetails -> {
String topicName = topicDetails.getKey();
Map<Integer, Long> value = topicDetails.getValue();
value.entrySet().forEach(partitionDetails -> {
if (partitionDetails.getValue() != 0) {
recordsToDelete.put(new TopicPartition(topicName, partitionDetails.getKey()),
RecordsToDelete.beforeOffset(partitionDetails.getValue()));
}
});
});
DeleteRecordsResult deleteRecords = this.client.deleteRecords(recordsToDelete);
Map<TopicPartition, KafkaFuture<DeletedRecords>> lowWatermarks = deleteRecords.lowWatermarks();
lowWatermarks.entrySet().forEach(entry -> {
try {
logger.info(entry.getKey().topic() + " " + entry.getKey().partition() + " "
+ entry.getValue().get().lowWatermark());
} catch (Exception ex) {
}
});
}
The output of my java program is given below
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 1 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 0 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 3 5000
2019-06-25 16:21:15 INFO MyKafkaAdminClient:247 - TEST-DELETE 2 5000
My intention is to delete the consumed records from the file system as I am working with limited storage for my kafka broker.
I would like to get some help with my below doubts
I was in the impression that the delete Records will remove the messages from the file system too, but look like I got it wrong!!
How long those deleted records be present in the log directory?
Is there any specific configuration that i need to use in order to remove the records from the files system once the delete Records API is invoked?
Appreciate your help
Thanks
The recommended approach to handle this is to set retention.ms and related configuration values for the topics you're interested in. That way, you can define how long Kafka will store your data until it deletes it, making sure all your downstream consumers have had the chance to pull down the data before it's deleted from the Kafk cluster.
If, however, you still want to force Kafka to delete based on bytes, there's the log.retention.bytes and retention.bytes configuration values. The first one is a cluster-wide setting, the second one is the topic-specific setting, which by default takes whatever the first one is set to, but you can still override it per topic. The retention.bytes number is enforced per partition, so you should multiply it by the total number of topic partitions.
Be aware, however, that if you have a run-away producer that starts generating a lot of data suddenly, and you have it set to a hard byte limit, you might wipe out entire days worth of data in the cluster, and only be left with the last few minutes of data, maybe before even valid consumers can pull down the data from the cluster. This is why it's much better to set your kafka topics to have time-based retention, and not byte-based.
You can find the configuration properties and their explanation in the official Kafka docs: https://kafka.apache.org/documentation/

JMeter test on Netty-based impl produces error for every second request

I've implemented an HTTP service based on the HTTP server example as provided by the netty.io project.
When I execute a GET request to the service URL from command-line (wget) or from a browser, I receive a result as expected.
When I perform a load test using ApacheBench ab -n 100000 -c 8 http://localhost:9000/path/to/service, experience no errors (neither on service nor on ab side) and see fair numbers for request processing duration.
Afterwards, I set up a test plan in JMeter having a thread group with 1 thread and a loop count of 2. I inserted an HTTP request sampler where I simply added the server name localhost, the port number 9000 and the path /path/to/service. Then I also added a View Results Tree and a Summary Report listener.
Finally, I executed the test plan and received one valid response and one error showing the following content:
Thread Name: Thread Group 1-1
Sample Start: 2015-06-04 09:23:12 CEST
Load time: 0
Connect Time: 0
Latency: 0
Size in bytes: 2068
Headers size in bytes: 0
Body size in bytes: 2068
Sample Count: 1
Error Count: 1
Response code: Non HTTP response code: org.apache.http.NoHttpResponseException
Response message: Non HTTP response message: The target server failed to respond
Response headers:
HTTPSampleResult fields:
ContentType:
DataEncoding: null
The associated exception found in response data tab showed the following content
org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at org.apache.jmeter.protocol.http.sampler.MeasuringConnectionManager$MeasuredConnection.receiveResponseHeader(MeasuringConnectionManager.java:201)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.executeRequest(HTTPHC4Impl.java:517)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:331)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1146)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1135)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:434)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:261)
at java.lang.Thread.run(Thread.java:745)
As I have a similar service already running which receives and processes web tracking data which shows no errors, it might be a problem within my test plan or JMeter .. but I am not sure :-(
Did anyone experience similar behavior? Thanks in advance ;-)
Issue can be related to Keep-Alive management.
Read those:
https://bz.apache.org/bugzilla/show_bug.cgi?id=57921
https://wiki.apache.org/jmeter/JMeterSocketClosed
So your solution is one of those:
If you're sure it's a keep alive issue:
Try jmeter nightly build http://jmeter.apache.org/nightly.html:
Download the _bin and _lib files
Unpack the archives into the same directory structure
The other archives are not needed to run JMeter.
And adapt the value of httpclient4.idletimeout
A workaround is to increase retry or add connection stale check as per :
https://wiki.apache.org/jmeter/JMeterSocketClosed

Categories