Currently we have a use case where we want to process some messages at later point of time, after some conditions met.
Is it possible to unacknowledge some pub/sub messages in apache beam pipeline which will be later available after visibility time out which we can process later?
You can't unack the message with Apache beam. When the message are correctly ingested in the pipeline, they are acked automatically.
You can keep them in the pipeline and reprocess them until the conditions are met. But you could have a congestion, or an overusage of Dataflow resources for nothing. It could be better to clean the message before, on a Cloud Functions for instance, that unack the message when they aren't valid, and publish in a target PubSub topic the valid messages.
As an alternative to #guillaume's suggestion, you can also store the "to-be-processed-later" messages (in raw format) in storage mediums such as BigQuery or Cloud Bigtable. All the messages will be acked by the pipeline and then the segregation can be done inside the pipeline where the "valid" messages are processed as usual while the "invalid" messages are preserved in storage for future processing.
Once the processing conditions are satisfied, the "invalid" messages can be retrieved from the storage medium and processed after which they can be deleted from storage. This could be a viable solution if the "invalid" messages will be processed after the message retention period which is 7 days.
The above workflow is inspired by this section of the Google Cloud blog. I considered the "invalid" messages to be "bad" data.
Related
The Java application posts async jobs to AWS and gets back a JobID. When the async job is finished, a message will appear in an SQS queue with that JobID. Each JobID is handled by a different thread. Each of those threads also polls SQS for messages until it finds the message which contains its JobID. Additionally, the application is distributed into multiple services so there can't be a single SQS processor.
I saw that SQS returns a maximum of 10 messages and after they are returned, a visibility timeout is applied so that they are not re-sent to other consumers. However, my consumers are the threads that want to consume only a single message and let the rest be consumed by other threads. Should I set the visibility timeout to 0? Will this make it so all consumers get the same set of 10 messages on every request? What's the best way for each consumer to sift through all the messages and find the one it wants?
TL;DR: SQS has 100 messages and there are 100 consumers, one for each message. How should I go about having each consumer find the message it wants (based on a JobID).
EDIT: I know that this is not an appropriate usage of SQS and I'd be very glad to not use it at all but our main integration is with Amazon Textract for which it is mandatory to use SQS for its asynchronous operations. Each Textract request is processed by a different thread which means that they each need to get back a specific SQS message, consumers are not universal. Not to mention the possibility of a clustered environment for which I'd like to avoid having to do any synchronization...
EDIT 2: This is for an on-premises, Setup.exe based, dev-hands-off application where we want to minimize the amount of unneeded AWS services used (both for cost and for customer setup/maintenance reasons) as well as the use of external components, again to minimize customer deployment/maintenance/servers. I understand that we are living in the world of microservices but there are still applications that want to benefit from intelligent services without being cloud-native themselves.
This is not an appropriate architecture for using Amazon SQS. Your processes should not be trying to find a specific message from an Amazon SQS queue.
You should find a different architecture for this message-passing task. Some ideas:
Create a message in Amazon S3 with an 'expected' Key. Have the each thread look for that object as a return message. (This is effectively using Amazon S3 as a Key-Value Store.)
Have a single Lambda function retrieve messages from SQS and update a database (or S3 as above). Then, have the threads consult the database instead of SQS.
I think you need to put something in between SQS and your threads. Like a DynamoDB table. You could have a Lambda function that processes all the SQS messages and just translates them into DynamoDB records. Then your different threads could easily check for the specific records they are interested in using a DynamoDB query.
Just because Textract mandates that you use SQS doesn't mean the final step in your architecture has to read those messages directly from SQS. In this case SQS is just a message bus that can integrate with other services in AWS, and those services are your building blocks you can use to create the architecture you need.
My system background:
I have a Scala application that read events from various queue technologies, transform the data and as final step send record to kinesis firehose.
My firehose is currently have S3 bucket as destination, but I don't want to bind it to the solution, since it could be changed in the future.
Problem statement
AWS guarantees a monthly uptime percentage of at least 99.9% (means ~44 minutes in a month) - See Service Level Agreement.
In addition, when quota limits are reached, putRecord will return ServiceUnavailableException - See PutRecord.
As part of the flow I can't afford events loss, so I'm searching for sort of solution for firehose unavailability.
We think about error handler that will:
Publish to SQS each event that failed to published into firehose.
When firehose get back to work, SNS will trigger AWS Lambda.
AWS Lambda will publish all events from SQS into firehose.
My question
I'm wondering if that solution is good enough and if there is best practice for handling errors when trying to putRecord into firehose?
Is it possible the get somehow the messageId field of a Pub/Sub message in a DoFn after using the PubSubIO Beam source to read the messages?
I need the default id which was assigned by the Pub/Sub service. I want to log it for debugging purposes.
Using a custom attribute for the unique id and the withIdAttribute() method is not possible for me, because I have no influence on the publisher in this case.
I use the 2.2.0 version of the Dataflow Java SDK.
Support for reading the Pubsub message id was added starting with Beam v2.16.0. To turn it on, replace .readMessages() with .readMessagesWithMessageId() in your pipeline setup then it is as easy as message.getMessageId() after that change.
For debugging purposes you can use the seek option.
It creates a snapshot of the messages which you can replay when needed.
I'm trying to implement a service which consumes a google pubsub subscription at its own pace. By that, I mean I need fine control on when I need to consume messages i.e get a batch of messages, pause for a while, do not get more than X messages...
Using google client libraries I did not find a way to do it as the MessageReceiver is running in its own thread and I don't have any control on what exactly happens.
Basically, being able to consume messages in a synchronous way should solve my issue.
Do you know how I can use the google client libs synchronously ? Or is there another way in the API I missed ?
You might try using setFlowControlSettings when you build your subscriber. In particular, you can use setMaxOutstandingElementCount or setMaxOutstandingRequestBytes to limit the messages sent to your MessageReceiver. When you have enough messages outstanding, i.e., messages for which you have not called Ack() or Nack(), to exceed these limits, then your MessageReceiver will not be called until messages have been acked or nacked.
How can we keep track of every message that gets into our Java Message Queue? We need to save the message for later reference. We already log it into an application log (log4j) but we need to query them later.
You can store them
in memory - in a collection or in an in-memory database
in a standalone database
You could create a database logging table for the messages, storing the message as is in a BLOB column, the timestamp that it was created / posted to the MQ and a simple counter as primary key. You can also add fields like message type etc if you want to create statistical reports on messages sent.
Cleanup of the tabe can be done simply by deleting all message older than the retention period by using the timestamp column.
I implemented such a solution in the past, we chose to store messages with all their characteristics in a database and developed a search, replay and cancel application on top of it. This is the Message Store pattern:
(source: eaipatterns.com)
We also used this application for the Dead Letter Channel.
(source: eaipatterns.com)
If you don't want to build a custom solution, have a look at the ReplayService for JMS from CodeStreet.
The best way to do this is to use whatever tracing facility your middleware provider offers. Or possibly, you could set up an intermediate listener whose only job was to log messages and forward on to your existing application.
In most cases, you will find that the middleware provider already has the ability to do this for you with no changes or awareness by your application.
I would change the queue to a topic, and then keep the original consumer that processes the messages, and add another consumer for auditing the messages to a database.
Some JMS providers cater for topic-to-queue-bridge definitions, the consumers then receive from their own dedicated queues, and don't have to read past messages that are left on the queue due to other consumers being inactive.
Alternatively, you could write a log4j appender, which writes your logged messages to a database.