I am using the below jar for SQS
aws-java-sdk-core-1.11.397.jar
aws-java-sdk-sqs-1.11.397.jar
In my scenario i will be using the same SQS multiple times and getting AmazonSQS object using AmazonSQSClientBuilder. I am wondering if we can cache this to help improving performance.
Would caching really help , which will be the best approach to do it and for how long can the object be cached.
Current scenario
Will be getting message to post SQS on a particular event whose frequency might vary from no event to around 10000 per hour. This is the reason why i am think to cache it.
You should most definitely reuse the AmazonSQS instance that is returned from the AmazonSQSClientBuilder. If you're posting thousands of messages, you shouldn't need to make any other calls to SQS other than sendMessage.
You could also call sendMessageBatch if you have a lot of messages to send. However, SQS is extremely scalable, and 10,000 messages per hour will not even make it sweat so I don't think you'll have anything to worry about.
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
I need a solution for the following scenario which is similar to a queue:
I want to write messages to a queue continuously. My message is very big, containing a lot of data so I do want to make as few requests as possible.
So my queue will contain a lot of messages at some point.
My Consumer will read from the queue every 1 hour. (not whenever a new message is written) and it will read all the messages from the queue.
The problem is that I need a way to read ALL the messages from the queue using only one call (I also want the consumer to make as few requests to the queue as possible).
A close solution would be ActiveMQ but the problem is that you can only read one message at a time and I need to read them all in one request.
So my question is.. Would there be other ways of doing this more efficiently? The actual thing that I need is to persist in some way messages created continuously by some application and then consume them (also delete them) by the same application all at once, every 1 hour.
The reason I thought a queue would be fit is because as the messages are consumed they are also deleted but I need to consume them all at once.
I think there's some important things to keep in mind as you're searching for a solution:
In what way do you need to be "more efficient" (e.g. time, monetary cost, computing resources, etc.)?
It's incredibly hard to prove that there are, in fact, no other "more efficient" ways to solve a particular problem, as that would require one to test all possible solutions. What you really need to know is, given your specific use-case, what solution is good enough. This, of course, requires knowing specifically what kind of performance numbers you need and the constraints on acquiring those numbers (e.g. time, monetary cost, computing resources, etc.).
Modern message broker clients (e.g. those shipped with either ActiveMQ 5.x or ActiveMQ Artemis) don't make a network round-trip for every message they consume as that would be extremely inefficient. Rather, they fetch blocks of messages in configurable sizes (e.g. prefetchSize for ActiveMQ 5.x, and consumerWindowSize for ActiveMQ Artemis). Those messages are stored locally in a buffer of sorts and fed to the client application when the relevant API calls are made to receive a message.
Making "as few requests as possible" is rarely a way to increase performance. Modern message brokers scale well with concurrent consumers. Consuming all the messages with a single consumer drastically limits the message throughput as compared to spinning up multiple threads which each have their own consumer. Rather than limiting the number of consumer requests you should almost certainly be maximizing them until you reach a point of diminishing returns.
I have problem with counting responses from response queue. I mean, once per day we run a job which gather some data from db and send them to queue. When we receive all responses we should shutdown connection. The problem is how we can check if all responses arrived ? Keeping this in global variable is risky because of concurrence issue. Any idea ? I am quite new in JMS so maybe solution is obvious but I dont see it.
I don't know what your stack is or whatever tools you might be using to accomplish this but I've got this in mind and this might help you out (hopefully).
Generate a hash for each job you plan on queuing and store it in a concurrent list/map. (i.e: ConcurrentHashMap)
Send the job to the queue.
Once the job is done and sends back a response, reproduce the hash and store it a separate concurrent list/map that holds all the jobs that are done.
Now that you have two lists of all the jobs supposed to be executed and the jobs that you got a response from. There multiple ways to accomplish this. If you lookup Java Concurrency, you'd find plenty of tutorials and documentation. I like to use CyclicBarrierandCountDownLatch`. If plan on using any of these methods, take extra precautions to prevent your application from hanging or worse, a filthy memory leak.
OR, you could simply check on how many queuing requests and responses you've and if they are equal to each other, drop the connection.
We currently have a distributed setup where we are publishing events to SQS and we have an application which has multiple hosts that drains messages from the queue and does some transformation over it and transmits to interested parties. I have a use case where the receiving end point has scalability concerns with the message volume and hence we would like to batch these messages periodically (say every 15 mins) in the application before sending it.
The incoming message rate is around 200 messages per second and each message is no more than 10 KB. This system need not be real time, but would definitely be a good to have and also the order is not important (its okay if a batch containing older messages gets sent first).
One approach that I can think of is maintaining an embedded database within the application (each host) that batches the events and another thread that runs periodically and clears the data.
Another approach could be to create timestamped buckets in a a distributed key-value store (s3, dynamo etc.) where we write the message to the correct bucket based the messages time stamp and we periodically clear the buckets.
We can run into several issues here, since the messages would be out of order a bucket might have already been cleared (can be solved by having a default bucket though), would need to accurately decide when to clear a bucket etc.
The way I see it, at least two components would be required one which does the batching into a temporary storage and another that clears it.
Any feedback on the above approaches would help, also it looks like a common problem are they any existing solutions that I can leverage ?
Thanks
I am building an application that reaches out to a FHIR API that implements paging, and only gives me a maximum of 100 results per page. However, our app requires the aggregation of these pages in order to hand over metadata to the UI about the entire result set.
When I loop through the pages of a large result set, I get HTTP status 429 - Too many requests. I am wondering if handing off these requests to a kafka service will help me get around this issue and maybe increase performance. I've read through the Intro and Use Cases sections of the Kafka documentation, but am still unclear as to whether implementing this tool will help.
You're getting 429 errors because you're making too many requests too quickly; you need to implement rate limiting.
As far as whether to use Kafka, a big part of that is whether your result set can fit in memory. If you can fit it in memory, then I would really suggest avoiding bringing in a separate service (KISS). If not, then yes, you can use Kafka. But I'd suggest taking a long think about whether you can use a relational datastore, because they're much more flexible. Or maybe even reading/writing directly to the disk
I were you, before I look into Kafka, I would try to solve why you are getting a 429 error. I would not leave that unnoticed. I would try to see how I am going to solve that.
I would looking into the following:
1) Sleep your process. The server response usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying.
2) Exponential backoff If the server's response does not tell you how long to wait, you can retry your request by inserting pauses by yourself in between.
Do keep it mind, before implementing sleep, it warrants extensive testing. You would have to make sure that your existing functionality does not get impacted.
To answer your question if Kafka would help you or not, the answer is it may or may not, with the limited info I can get from your question. Do understand that implementing Kafka would change your network architecture. You are bringing in a streaming platform to the equation. You would most probably implement caching which would aggregate your results. But at the moment all these concepts are at a very holistic level. I would suggest that you first ought to solve the 429 error and then warrant if a proper technical reason is present to implement Kafka which would improve your website's performance.