getAttribute from PubsubMessage in Dataflow

getAttribute from PubsubMessage in Dataflow - java

I have a problem trying to access pubsub message's attributes.
The error message is the following:
Coder of type class org.apache.beam.sdk.coders.SerializableCoder has a #structuralValue method which does not return true when the encoding of the elements is equal.
stackTrace: [org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage.getAttribute(PubsubMessage.java:56),
transform1$3.processElement(transform1.java:37),
transform1$3$DoFnInvoker.invokeProcessElement(Unknown Source),
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218),
org.apache.beam.repackaged.direct_java.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183),
org.apache.beam.repackaged.direct_java.runners.core.SimplePushbackSideInputDoFnRunner.processElementInReadyWindows(SimplePushbackSideInputDoFnRunner.java:78),
org.apache.beam.runners.direct.ParDoEvaluator.processElement(ParDoEvaluator.java:216),
org.apache.beam.runners.direct.DoFnLifecycleManagerRemovingTransformEvaluator.processElement(DoFnLifecycleManagerRemovingTransformEvaluator.java:54),
org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:160), org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:124),
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511),
java.util.concurrent.FutureTask.run(FutureTask.java:266),
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149),
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624),
java.lang.Thread.run(Thread.java:748)]
I'm using the Dataflow Eclipse SDK to run the pipeline locally:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
The line of code which produces the error is this:
String fieldId = c.element().getAttribute("evId");
The full code of the ptransform is the following:
public class transform1 extends DoFn<PubsubMessage, Event> {
public static TupleTag<ErrorHandler> failuresTag=new TupleTag<ErrorHandler>(){};
public static TupleTag<Event> validTag = new TupleTag<Event>(){};
public static PCollectionTuple process(PCollection<PubsubMessage> logStrings)
{
return logStrings.apply("Create PubSub objects", ParDo.of(new DoFn<PubsubMessage, Event>()
{
#ProcessElement
public void processElement(ProcessContext c)
{
try
{
Event event = new Event();
String fieldId = c.element().getAttribute("evId");
event.evId = "asa"; //this line is just to test to set a value
c.output(event);
<...>
I have seen a similar question but I'm not sure how I could fix it
The main pipeline code (if needed)
public static PipelineResult run(Options options) {
Pipeline pipeline = Pipeline.create(options);
/*
* Step 1: Read from PubSub
*/
PCollection<PubsubMessage> messages = null;
if (options.getUseSubscription()) {
messages = pipeline.apply("ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getInputSubscription()).withIdAttribute("messageId"));
} else {
messages = pipeline.apply("ReadPubSubTopic", PubsubIO.readMessagesWithAttributes()
.fromTopic(options.getInputTopic()).withIdAttribute("messageId"));
}
/*
* Step 2: Transform PubSubMessage to Event
*/
PCollectionTuple eventCollections = transform1.process(messages);
PubSub message:
{ "evId":"id", "payload":"payload" }
I also tried as:
"{ "evId":"id", "payload":"payload" }"
This is how I publish the message in pubsub to test the pipeline:
After making more test, the way I was publishing to pubsub it seems to be the source of the error, because If I added as attribute instead of message body the problem disappear.

The reason was I was trying to access to an attribute here:
String fieldId = c.element().getAttribute("evId");
But when I was sending the message through the pubsub dashboard I didn't add any attribute and it cause all the pipeline crash.

Related

PubsubToBQ tableCreation before insert

I'm trying to make a life Bigquery table creation before the inserting process itself. Here is the code of PTransform that I'm using -> Link
This transform I would like to apply on Pubsub messages that would be inserted in BQ table later.
Phase 1. Getting pubsub messages:
PCollection<PubsubMessage> messages =
pipeline.apply(
"ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getInputSubscription()));
Phase 2. Convert all pubsub messages to TableRow:
PCollectionTuple convertedTableRows =
messages
.apply("ConvertMessageToTableRow", new PubsubMessageToTableRow(options));
Phase 3. Here is the problem, I need to check if table exist and upload the result to BQ:
###here is the schema for our BQ table
public static final Schema schema1 =
Schema.of(
Field.of("name", StandardSQLTypeName.STRING),
Field.of("post_abbr", StandardSQLTypeName.STRING));
### here is the method that we using to extract table name from pubsub attributes
static class PubSubAttributeExtractor implements SerializableFunction<ValueInSingleWindow<TableRow>, String> {
private final String attribute;
public PubSubAttributeExtractor(String attribute) {
this.attribute = attribute;
}
#Override
public String apply(ValueInSingleWindow<TableRow> input) {
TableRow row = input.getValue();
String tableName = (String) row.get("name");
return "my-project:myDS.pubsub_" + tableName;
}
}
### here is the part that doesn't work
WriteResult writeResult = convertedTableRows.get(TRANSFORM_OUT)
.apply(new BigQueryAutoCreateTable(
new PubSubAttributeExtractor("event_name"),schema1));
.apply(
"WriteSuccessfulRecords",
BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to(new ProbPartitionDestinations(options.getOutputTableSpec())
)
);
Error logs:
cannot find symbol
symbol: method apply(java.lang.String,org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write<com.google.api.services.bigquery.model.TableRow>)
location: interface org.apache.beam.sdk.values.POutput

cloud function doesn't capture pubsub message, even though it is triggered

In my code I have 2 cloud functions, cf1 and cf2. cf1 is triggered via pubsub topic t1 with a Google Scheduler cron job every 10 minutes and creates a list and sends it to topic t2 that triggers cf2. When I use the Google's example for the cf2 I can see my message and it works. However when I deploy my own code and log the message this is what I see: ```
cf2.accept:81) - data
.accept:83) - ms {"data_":{"bytes":[],"hash":0},"messageId_":"","orderingKey_":"","memoizedIsInitialized":-1,"unknownFields":{"fields":{},"fieldsDescending":{}},"memoizedSize":-1,"memoizedHashCode":0}
My code is: ```
public class cf2 implements BackgroundFunction<PubsubMessage> {
#Override
public void accept(PubsubMessage message, Context context) throws Exception {
if (message.getData() == null) {
logger.info("No message provided");
return;
}
String messageString = new String(
Base64.getDecoder().decode(message.getData().toStringUtf8()),
StandardCharsets.UTF_8);
logger.info(messageString);
logger.info("Starting the job");
String data = message.getData().toStringUtf8();
logger.info("data "+ data);
String ms = new Gson().toJson(message);
logger.info("ms "+ ms);
}```
But when I use Google's example code :
package com.example;
import com.example.Example.PubSubMessage;
import com.google.cloud.functions.BackgroundFunction;
import com.google.cloud.functions.Context;
import java.util.Base64;
import java.util.Map;
import java.util.logging.Logger;
public class Example implements BackgroundFunction<PubSubMessage> {
private static final Logger logger = Logger.getLogger(Example.class.getName());
#Override
public void accept(PubSubMessage message, Context context) {
String data = message.data != null
? new String(Base64.getDecoder().decode(message.data))
: "empty message";
logger.info(data);
}
public static class PubSubMessage {
String data;
Map<String, String> attributes;
String messageId;
String publishTime;
}
}
I see my message body very neatly in the logs. Can someone help me with what is wrong with my code?
Here's how I deploy my function:
gcloud --project=${PROJECT_ID} functions deploy \
cf2 \
--entry-point=path.to.cf2 \
--runtime=java11 \
--trigger-topic=t2 \
--timeout=540\
--source=folder \
--set-env-vars="PROJECT_ID=${PROJECT_ID}" \
--vpc-connector=projects/${PROJECT_ID}/locations/us-central1/connectors/appengine-default-connect
and when I log the message.getData() I get <ByteString#37c278a2 size=0 contents=""> while I know message is not empty ( I made another test subscription on the topic that helps me see the message there )

You need to define what is a PubSub message. This part is missing in your code and I don't know which PubSubMessage type you are using:
public static class PubSubMessage {
String data;
Map<String, String> attributes;
String messageId;
String publishTime;
}
It should solve your issue. Let me know.

How to deserialize avro data using Apache Beam (KafkaIO)

I've only seen one thread containing information about the topic I've mentioned which is :
How to Deserialising Kafka AVRO messages using Apache Beam
However, after trying a few variations of kafkaserializers I still cannot deserialize kafka messages. Here's my code:
public class Readkafka {
private static final Logger LOG = LoggerFactory.getLogger(Readkafka.class);
public static void main(String[] args) throws IOException {
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
PTransform<PBegin, PCollection<KV<action_states_pkey, String>>> kafka =
KafkaIO.<action_states_pkey, String>read()
.withBootstrapServers("mybootstrapserver")
.withTopic("action_States")
.withKeyDeserializer(MyClassKafkaAvroDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("schema.registry.url", (Object)"schemaregistryurl"))
.withMaxNumRecords(5)
.withoutMetadata();
p.apply(kafka)
.apply(Keys.<action_states_pkey>create())
}
where MyClassKafkaAvroDeserilizer is
public class MyClassKafkaAvroDeserializer extends
AbstractKafkaAvroDeserializer implements Deserializer<action_states_pkey> {
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
configure(new KafkaAvroDeserializerConfig(configs));
}
#Override
public action_states_pkey deserialize(String s, byte[] bytes) {
return (action_states_pkey) this.deserialize(bytes);
}
#Override
public void close() {} }
and the class action_states_pkey is code generated from avro tools using
java -jar pathtoavrotools/avro-tools-1.8.1.jar compile schema pathtoschema/action_states_pkey.avsc destination path
where the action_states_pkey.avsc is literally
{"type":"record","name":"action_states_pkey","namespace":"namespace","fields":[{"name":"ad_id","type":["null","int"]},{"name":"action_id","type":["null","int"]},{"name":"state_id","type":["null","int"]}]}
With this code I'm getting the error :
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to my.mudah.beam.test.action_states_pkey
at my.mudah.beam.test.MyClassKafkaAvroDeserializer.deserialize(MyClassKafkaAvroDeserializer.java:20)
at my.mudah.beam.test.MyClassKafkaAvroDeserializer.deserialize(MyClassKafkaAvroDeserializer.java:1)
at org.apache.beam.sdk.io.kafka.KafkaUnboundedReader.advance(KafkaUnboundedReader.java:221)
at org.apache.beam.sdk.io.BoundedReadFromUnboundedSource$UnboundedToBoundedSourceAdapter$Reader.advanceWithBackoff(BoundedReadFromUnboundedSource.java:279)
at org.apache.beam.sdk.io.BoundedReadFromUnboundedSource$UnboundedToBoundedSourceAdapter$Reader.start(BoundedReadFromUnboundedSource.java:256)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:592)
... 14 more
It seems there's an error in trying to map the Avro Data to my custom class ?
Alternatively, I've tried the following code :
PTransform<PBegin, PCollection<KV<action_states_pkey, String>>> kafka =
KafkaIO.<action_states_pkey, String>read()
.withBootstrapServers("bootstrapserver")
.withTopic("action_states")
.withKeyDeserializerAndCoder((Class)KafkaAvroDeserializer.class, AvroCoder.of(action_states_pkey.class))
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("schema.registry.url", (Object)"schemaregistry"))
.withMaxNumRecords(5)
.withoutMetadata();
p.apply(kafka);
.apply(Keys.<action_states_pkey>create())
// .apply("ExtractWords", ParDo.of(new DoFn<action_states_pkey, String>() {
// #ProcessElement
// public void processElement(ProcessContext c) {
// action_states_pkey key = c.element();
// c.output(key.getAdId().toString());
// }
// }));
which does not give me any error until i try to print out the data. I have to verify that I'm succesfully reading the data one way or another so my intent here is to log the data in the console. If I uncomment the commented section i get the same error once again:
SEVERE: 2019-09-13T07:53:56.168Z: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to my.mudah.beam.test.action_states_pkey
at my.mudah.beam.test.Readkafka$1.processElement(Readkafka.java:151)
Another thing to note is that if I specify :
.updateConsumerProperties(ImmutableMap.of("specific.avro.reader", (Object)"true"))
always gives me an error of
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 443
Caused by: org.apache.kafka.common.errors.SerializationException: Could not find class NAMESPACE.action_states_pkey specified in writer's schema whilst finding reader's schema for a SpecificRecord.
It seems there's something wrong with my approach?
If anyone has any experience reading AVRO data from Kafka Streams using Apache Beam, please do help me out. I greatly appreciate it.
Here's a snapshot of my package with the schema and class in it as well:
package/working path details
Thanks.

public class MyClassKafkaAvroDeserializer extends
AbstractKafkaAvroDeserializer
Your class is extending the AbstractKafkaAvroDeserializer which returns GenericRecord.
You need to convert the GenericRecord to your custom object.
OR
Use SpecificRecord for this as stated in one of the following answers:
/**
* Extends deserializer to support ReflectData.
*
* #param <V>
* value type
*/
public abstract class ReflectKafkaAvroDeserializer<V> extends KafkaAvroDeserializer {
private Schema readerSchema;
private DecoderFactory decoderFactory = DecoderFactory.get();
protected ReflectKafkaAvroDeserializer(Class<V> type) {
readerSchema = ReflectData.get().getSchema(type);
}
#Override
protected Object deserialize(
boolean includeSchemaAndVersion,
String topic,
Boolean isKey,
byte[] payload,
Schema readerSchemaIgnored) throws SerializationException {
if (payload == null) {
return null;
}
int schemaId = -1;
try {
ByteBuffer buffer = ByteBuffer.wrap(payload);
if (buffer.get() != MAGIC_BYTE) {
throw new SerializationException("Unknown magic byte!");
}
schemaId = buffer.getInt();
Schema writerSchema = schemaRegistry.getByID(schemaId);
int start = buffer.position() + buffer.arrayOffset();
int length = buffer.limit() - 1 - idSize;
DatumReader<Object> reader = new ReflectDatumReader(writerSchema, readerSchema);
BinaryDecoder decoder = decoderFactory.binaryDecoder(buffer.array(), start, length, null);
return reader.read(null, decoder);
} catch (IOException e) {
throw new SerializationException("Error deserializing Avro message for id " + schemaId, e);
} catch (RestClientException e) {
throw new SerializationException("Error retrieving Avro schema for id " + schemaId, e);
}
}
}
The above is copied from https://stackoverflow.com/a/39617120/2534090
https://stackoverflow.com/a/42514352/2534090

InvalidParameterException on SNS topic using java

While trying to send message to AWS SNS topic using com.amazonaws.services.sns java module, I am stuck on following error:
shaded.com.amazonaws.services.sns.model.InvalidParameterException: Invalid parameter: Message too long (Service: AmazonSNS; Status Code: 400; Error Code: InvalidParameter; Request ID: 3b01ce49-a37d-5aba-bec2-9ab9d5446aea)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1257)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1029)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:741)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:715)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:697)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:665)
at shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:647)
at shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:511)
at shaded.com.amazonaws.services.sns.AmazonSNSClient.doInvoke(AmazonSNSClient.java:2270)
at shaded.com.amazonaws.services.sns.AmazonSNSClient.invoke(AmazonSNSClient.java:2246)
at shaded.com.amazonaws.services.sns.AmazonSNSClient.executePublish(AmazonSNSClient.java:1698)
at shaded.com.amazonaws.services.sns.AmazonSNSClient.publish(AmazonSNSClient.java:1675)
Following is the AmazonSNS helper Class. This class manages the client creation and publishing message to SNS topic.
import java.io.Serializable;
import com.amazonaws.services.sns.AmazonSNS;
import com.amazonaws.services.sns.AmazonSNSClientBuilder;
import com.amazonaws.services.sns.model.PublishRequest;
import com.amazonaws.services.sns.model.PublishResult;
public class AWSSNS implements Serializable {
private static final long serialVersionUID = -4175291946259141176L;
protected AmazonSNS client;
public AWSSNS(){
this.client=AmazonSNSClientBuilder.standard().withRegion("us-west-2").build();
}
public AWSSNS(AmazonSNS client) {
this.client=client;
}
public AmazonSNS getSnsClient(){
return this.client;
}
public void setSqsClient(AmazonSNS client){
this.client = client;
}
public boolean sendMessages(String topicArn, String messageBody){
PublishRequest publishRequest = new PublishRequest(topicArn, messageBody);
PublishResult publishResult = this.client.publish(publishRequest);
if(publishResult != null && publishResult.getMessageId() != null){
return true;
}
else{
return false;
}
}
}
Following is the code snippet from where the amazonSNS helper class is being called.It does nothing but create a message of String dataType and send it forward along with the topicARN.
HashMap<String, String> variable_a = new HashMap<String, String>();
Gson gson = new Gson();
for (Object_a revoke : Object_a) {
Object_a operation = someMethod1(revoke);
String serializedOperation = gson.toJson(operation);
variable_a.put(revoke.someMethod2(), serializedOperation);
String message = gson.toJson(variable_a);
LOG.info(String.format("SNS message: %s", message));
this.awsSNS.sendMessages(topicARN, message);
}
So basically the error is thrown from inside the sendMessage.

Found the solution to the problem.
AWS SNS topic has a fixed maximum size. So, on publishing messages of larger size than the maximum size would result in invalidParameterException with message "message too long".
My message was more than that size and that was the reason for the error. I shredded the message until the size came under max size.

How to convert pubsub payload to LogEntry object in log export

I have enabled log exports to a pub sub topic. I am using dataflow to process these logs and store relevant columns in BigQuery. Can someone please help with the conversion of the pubsub message payload to a LogEntry object.
I have tried the following code:
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
PubsubMessage pubsubMessage = c.element();
ObjectMapper mapper = new ObjectMapper();
byte[] payload = pubsubMessage.getPayload();
String s = new String(payload, "UTF8");
LogEntry logEntry = mapper.readValue(s, LogEntry.class);
}
But I got the following error:
com.fasterxml.jackson.databind.JsonMappingException: Can not find a (Map) Key deserializer for type [simple type, class com.google.protobuf.Descriptors$FieldDescriptor]
Edit:
I tried the following code:
try {
ByteArrayInputStream stream = new ByteArrayInputStream(Base64.decodeBase64(pubsubMessage.getPayload()));
LogEntry logEntry = LogEntry.parseDelimitedFrom(stream);
System.out.println("Log Entry = " + logEntry);
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
}
But I get the following error now:
com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

The JSON format parser should be able to do this. Java's not my strength, but I think you're looking for something like:
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
LogEntry.Builder entryBuilder = LogEntry.newBuilder();
JsonFormat.Parser.usingTypeRegistry(
JsonFormat.TypeRegistry.newBuilder()
.add(LogEntry.getDescriptor())
.build())
.ignoringUnknownFields()
.merge(c.element(), entryBuilder);
LogEntry entry = entryBuilder.build();
...
}
You might be able to get away without registering the type. I think in C++ the proto types are linked into a global registry.
You'll want "ignoringUnknownFields" in case the service adds new fields and exports them and you haven't updated your copy of the proto descriptor. Any "#type" fields in the exported JSON that will cause problems too.
You may need special handling of the payload (i.e. strip if from the JSON and then parse it separately). If it's JSON I'd expect the parser to try populating sub-messages that don't exist. If it's proto ... it actually might work if you register the Any type too.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

getAttribute from PubsubMessage in Dataflow - java

The reason was I was trying to access to an attribute here: String fieldId = c.element().getAttribute("evId"); But when I was sending the message through the pubsub dashboard I didn't add any attribute and it cause all the pipeline crash.

Related

PubsubToBQ tableCreation before insert

cloud function doesn't capture pubsub message, even though it is triggered

How to deserialize avro data using Apache Beam (KafkaIO)

InvalidParameterException on SNS topic using java

How to convert pubsub payload to LogEntry object in log export

Categories

Resources