Kafka Global Table not updating from source topic - java

I am creating a Springboot app and struggling to understand why my GlobalKtable is not updating.
As far as I understand, the global table is supposed to update automatically when the source topic is updated. This is not the case for me.
I did notice the global table becomes populated with new data after I manually delete the state store folder.
I also noted the following error output when the Spring boot app is launched:
**2021-11-27 23:09:18.232 ERROR 17592 --- [ main] o.a.k.s.p.internals.StateDirectory : Failed to change
permissions for the directory d:\kafkastreamsdb
2021-11-27 23:09:18.233 ERROR 17592 --- [ main]
o.a.k.s.p.internals.StateDirectory : Failed to change
permissions for the directory d:\kafkastreamsdb\Kafka-streams**
Seems to me the reason why I only see all the current data in the GlobalKtable after deleting the statestore folder is because the stream is not writing to the state store while the stream is running, but recreates the state store from the source topic after deletion?
So, the issue here is when I try to use the global table as a look-up in the join below, the enriched stream returns NULL for the table values. However, when I delete the state store folder, and restart Springboot, the enriched stream does return the table values.
Just to clarify, new events are continuously sent to the source topic, but this data is only visible in the table after deletion.
Here is my code:
#Service
public class TopologyBuilder2 {
public static Topology build() {
StreamsBuilder builder = new StreamsBuilder();
// Register FakeAddress stream
KStream<String, FakeAddress> streamFakeAddress =
builder.stream("FakeAddress", Consumed.with(Serdes.String(), JsonSerdes.FakeAddress()));
GlobalKTable<String, Greetings> globalGreetingsTable = builder.globalTable(
"Greetings"
, Consumed.with(Serdes.String(), JsonSerdes.Greetings())
, Materialized.<String, Greetings, KeyValueStore<Bytes, byte[]>>as(
"GREETINGS" /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(JsonSerdes.Greetings()) /* value serde */);
// LEFT Key mapper
KeyValueMapper<String, FakeAddress, String> keyMapperFakeAddress =
( leftkey, fakeAddress) -> {
// System.out.println(String.valueOf(fakeAddress.getCountry()));
return String.valueOf(fakeAddress.getCountry());
};
// Value joiner
ValueJoiner<FakeAddress, Greetings, EnrichedCountryGreeting> valueJoinerFakeAddressAndGreetings =
(fakeAddress, greetings) -> new EnrichedCountryGreeting(fakeAddress, greetings);
KStream<String, EnrichedCountryGreeting> enrichedStream
= streamFakeAddress.join(globalGreetingsTable, keyMapperFakeAddress, valueJoinerFakeAddressAndGreetings);
enrichedStream.print(Printed.<String, EnrichedCountryGreeting>toSysOut().withLabel("Stream-enrichedStream: "));
return builder.build();
}
}

Related

Why am I getting "The provided key element does not match the schema" error with Java DynamoDB AWS SDK?

I'm trying to implement with DynamoDB a way to allow items to be only inserted and NOT updated/replaced (to achieve some level of transaction control). Therefore, I have the following DB Schema configured in AWS DynamoDB:
PartitionKey="driveUniqueId (String)"
SortKey="createTime (String)"
Now, if I run the following snippet of code, the operation works and the record is created.
String currentTime = LocalTime.now().toString();
dynamoDbClient.transactWriteItems(TransactWriteItemsRequest.builder()
.transactItems(
TransactWriteItem.builder().put(
Put.builder()
.tableName(tableName)
.item(Map.of(
"driveUniqueId", AttributeValue.builder().s("789").build(),
"createTime", AttributeValue.builder().s(currentTime).build()))
.build())
.build())
.build());
However, If I run the following snippet of code adding the conditionCheck to prevent replaces, then I get the following error:
dynamoDbClient.transactWriteItems(TransactWriteItemsRequest.builder()
.transactItems(
TransactWriteItem.builder().put(
Put.builder()
.tableName(tableName)
.item(Map.of(
"driveUniqueId", AttributeValue.builder().s("123").build(),
"createTime", AttributeValue.builder().s(currentTime).build()))
.build())
.build(),
TransactWriteItem.builder().conditionCheck(
ConditionCheck.builder()
.tableName(tableName)
.key(Map.of(
"driveUniqueId", AttributeValue.builder().s("123").build()))
.conditionExpression("attribute_not_exists(driveUniqueId)")
.build())
.build())
.build());
/*
Error:
CancellationReason(Code=None)
CancellationReason(Code=ValidationError, Message=The provided key element does not match the schema)
*/
I don't understand why the condition specified can't understand the "driveUniqueId" as part of the schema in the second write operation.
I'm running this code using Quarkus, Java 17, and AWS SDK 2.17.291.
Any ideas why my conditionCheck is wrong?
Your condition check is causing a validation error because you did not include the items sork key createTime.
As with any write to DynamoDB you must include the full primary key, partition and sort key.
Furthermore, you possibly do not need to use transactions, single item operations are ACID compliant, so you could simply use a PutItem with a condition check.
This is my final solution based on #Lee Hanningan response.
/**
* This method guarantees that a Item with a {#param uniqueId} will be processed only once
* by the POD who grabbed its lock (a.k.a. being the first to insert the Item into DynamoDB table).
* If another POD tries to re-process the same Item in parallel or in the future it'll fail.
*
* This property guarantees that Item will contribute to the process only once, in case of a POD
* network partition not acknowledging to SQS that a Item was processed successfully.
*
* #param uniqueId
* #return isItemLocked.
*/
public boolean insertAndLockProcessingToItem(final String uniqueId) {
boolean itemLockedToBeProcessed;
try {
String currentTime = LocalDateTime.now().toString();
dynamoDbClient.putItem(PutItemRequest.builder()
.tableName(TABLE_NAME)
.item(Map.of(
"uniqueId", AttributeValue.builder().s(uniqueId).build(),
"createTime", AttributeValue.builder().s(currentTime).build()))
.conditionExpression("attribute_not_exists(uniqueId)")
.build());
itemLockedToBeProcessed = true;
} catch (final ConditionalCheckFailedException ccfe) {
itemLockedToBeProcessed = false;
} catch (final Exception ex) {
throw new RuntimeException(ex);
}
return itemLockedToBeProcessed;
}

How to create a DataStreamSource from a Mysql Database?

I have a problem running a flink job that is basically running a query against a mysql database and then tries to create a temporary view that must be accessed from a different job.
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final TypeInformation<?>[] fieldTypes =
new TypeInformation<?>[] {
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
final RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
String selectQuery = "select * from ***";
String driverName = "***";
String sourceDb = "***";
String dbUrl = "jdbc:mysql://mySqlDatabase:3306/";
String dbPassword = "***";
String dbUser = "***";
JdbcInputFormat.JdbcInputFormatBuilder inputBuilder =
JdbcInputFormat.buildJdbcInputFormat()
.setDrivername(driverName)
.setDBUrl(dbUrl + sourceDb)
.setQuery(selectQuery)
.setRowTypeInfo(rowTypeInfo)
.setUsername(dbUser)
.setPassword(dbPassword);
DataStreamSource<Row> source = env.createInput(inputBuilder.finish());
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table customerTable =
tableEnv.fromDataStream(source).as("id", "name", "test");
tableEnv.createTemporaryView("***", ***Table);
Table resultTable = tableEnv.sqlQuery(
"SELECT * FROM ***");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
env.execute();
I'm quite new to Flink, and I'm currently going trough the APIs provided for all of these, but I can't actually understand what I'm doing wrong. In my mind, testing this process by printing the result at the end of the job seems straight forward, but the only thing I get printed is something like this:
2022-02-14 12:22:57,702 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680) switched from INITIALIZING to RUNNING.
2022-02-14 12:22:57,853 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680) switched from RUNNING to FINISHED.
2022-02-14 12:22:57,853 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680).
2022-02-14 12:22:57,856 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 8a1cd3aa6a753c9253926027b1332680.
The point of this job would be to create a temporary table view used for caching some static data that will be used in other Flink jobs by querying that table view.
For more context on how to use MySQL with Flink, see https://stackoverflow.com/a/71030967/2000823. As a streaming data source, it's more common to work with MySQL's write-ahead-log as a CDC stream, but another approach that is sometimes taken (but not encouraged by Flink's APIs) is to periodically poll MySQL with a SELECT query.
As for what you've tried, using createInput is discouraged for streaming jobs, as this doesn't work with Flink's checkpointing mechanism. Rather than using a hadoop input format, it's better to choose one of the available source connectors.
A temporary view doesn't hold any data, and isn't something that can be accessed from another job. A Flink table, or a view, is metadata describing how data stored somewhere else (e.g., in mysql or kafka) is to be interpreted as a table by Flink. You can store a view in a catalog so that multiple jobs can share its definition, but the underlying data will remain in the external data store, and only the view metadata is stored in the catalog.
So in this case, the job you've written will create a temporary view that is only visible to this job and no others (since it is a temporary view, and not a persistent view stored in a persistent catalog). The output of your job won't be in the log file(s), but will instead go to stdout, or to *.out files in the logging directory of each task manager.
Frist of all, test whether the data of mysql can be read normally
May be you can directly print the source result as follows
DataStreamSource<Row> source = env.createInput(inputBuilder.finish());
source.print()
env.execute();

Get only a subset of fields from a Kafka topic using Apache Beam

Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.

Extracting Timestamp from producer message

I really need help!
I can't extract the timestamp for a message sent by a producer. In my project I work with Json, I have a class in which I define the keys and one in which I define the values ​​of the message that I will send via a producer on a "Raw" topic. I have 2 other classes that do the same thing for the output message that my consumer will read on the topic called "Tdt". In the main class KafkaStreams.java I define the stream and map the keys and values. Starting Kafka locally, I start a producer who writes a message on the "raw" topic with keys and values, then on another shell the consumer starts reading the exit message on the "tdt" topic. How do I get the event timestamp? I need to know the timestamp in which the message was sent by the producer. Do I need a TimestampExtractor?
Here is my main class kafkastreams (my application works great, I just need the timestamp)
#Bean("app1StreamTopology")
public KStream<LibAssIbanRawKey, LibAssIbanRawValue> kStream() throws ParseException {
JsonSerde<Dwsitspr4JoinValue> Dwsitspr4JoinValueSerde = new JsonSerde<>(Dwsitspr4JoinValue.class);
KStream<LibAssIbanRawKey, LibAssIbanRawValue> stream = defaultKafkaStreamsBuilder.stream(inputTopic);
stream.peek((k,v) -> logger.info("Debug3 Chiave descrizione -> ({})",v.getCATRAPP()));
GlobalKTable<Integer, Dwsitspr4JoinValue> categoriaRapporto = defaultKafkaStreamsBuilder
.globalTable(temptiptopicname,
Consumed.with(Serdes.Integer(), Dwsitspr4JoinValueSerde)
// .withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST)
);
logger.info("Debug3 Chiave descrizione -> ({})",categoriaRapporto.toString()) ;
stream.peek((k,v) -> logger.info("Debug4 Chiave descrizione -> ({})",v.getCATRAPP()) );
stream
.join(categoriaRapporto, (k, v) -> v.getCATRAPP(), (valueStream, valueGlobalKtable) -> {
// Value mapping
LibAssIbanTdtValue newValue = new LibAssIbanTdtValue();
newValue.setDescrizioneRidottaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneRidotta());
newValue.setDescrizioneEstesaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneEstesa());
newValue.setIdentificativo(valueStream.getAUD_CCID());
.
.
.//Other Value Mapped
.
.
.map((key, value) -> {
// Key mapping
LibAssIbanTdtKey newKey = new LibAssIbanTdtKey();
newKey.setData(dtf.format(localDate));
newKey.setIdentificatoreUnivocoDellaRigaDiTabella(key.getTABROWID());
return KeyValue.pair(newKey, value);
}).to(outputTopic, Produced.with(new JsonSerde<>(LibAssIbanTdtKey.class), new JsonSerde<>(LibAssIbanTdtValue.class)));
return stream;
}
}
Yes you need a TimestampExtractor.
public class YourTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> consumerRecord, long l) {
// do whatever you want with the timestamp available with consumerRecord.timestamp()
...
// return here the timestamp you want to use (here default)
return consumerRecord.timestamp();
}
}
You'll need to tell kafka stream what extractor to use under the key StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG

unexpected multiple execution of mapper intended to run once

I tried to write a very simple job with only 1 mapper and no reducer to write some data to hbase. In the mapper I tried to simply open connection with hbase, write a few rows of data to a table and then close connection. In job driver I am using JobConf.setNumMapTasks(1); and JobConf.setNumReduceTasks(0); to specify that only 1 mapper and no reducer are to be executed. I am also setting the reducer class to IdentityReducer in jobConf. The strange behavior I am observing is that the job successfully writes the data to hbase table however after that I see in the logs it continuously tried to open connection with hbase and then closes the connection which goes on for 20-30 minutes and after the job is declared to have completed with 100% success. At the end when I check the _success file created by the dummy data I put in OutputCollector.collect(...) I see hundred of rows of dummy data when there should only be 1.
Following is the code for job driver
public int run(String[] arg0) throws Exception {
Configuration config = HBaseConfiguration.create(getConf());
ensureRequiredParametersExist(config);
ensureOptionalParametersExist(config);
JobConf jobConf = new JobConf(config, getClass());
jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
//set map specific configuration
jobConf.setNumMapTasks(1);
jobConf.setMaxMapAttempts(1);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setMapperClass(SingletonMapper.class);
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(Text.class);
//set reducer specific configuration
jobConf.setReducerClass(IdentityReducer.class);
jobConf.setOutputKeyClass(LongWritable.class);
jobConf.setOutputValueClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setNumReduceTasks(0);
//set job specific configuration details like input file name etc
FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
FileOutputFormat.setOutputPath(jobConf,
new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
JobClient.runJob(jobConf);
return 0;
}
Driver class extends Configured and implements Tool (I used the sample from definitive guide)Following is the code in my mapper class.
Following is the code in my Mapper's map method where I simply open the connection with Hbase, do some preliminary check to make sure table exists and then write the rows and close the table.
public void map(LongWritable arg0, Text arg1,
OutputCollector<LongWritable, Text> arg2, Reporter arg3)
throws IOException {
HTable aTable = null;
HBaseAdmin admin = null;
try {
arg3.setStatus("started");
/*
* set-up hbase config
*/
admin = new HBaseAdmin(conf);
/*
* open connection to table
*/
String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);
HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);
byte[] tablename = htd.getName();
/* call function to ensure table with 'tablename' exists */
/*
* loop and put the file data into the table
*/
aTable = new HTable(conf, tableName);
DataRow row = /* logic to generate data */
while (row != null) {
byte[] rowKey = toBytes(row.getRowKey());
Put put = new Put(rowKey);
for (DataNode node : row.getRowData()) {
put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
toBytes(node.getNodeValue()));
}
aTable.put(put);
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
row = fileParser.getNextRow();
}
aTable.flushCommits();
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");
} finally {
if (aTable != null) {
aTable.close();
}
if (admin != null) {
admin.close();
}
}
arg2.collect(new LongWritable(10), new Text("something"));
arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}
As you could see around the end that I am writing some dummy data to collection in the end (10, 'something') and I see hundreds of rows of this data in the _success file after the job has terminated.
I can't identify why the mapper code is restarted multiple times over and over instead of running just once. Any help would be greatly appreciated.
Using JobConf.setNumMapTasks(1) is just saying to hadoop that you wish to use 1 mapper, if possible, unlike the setNumReduceTasks, which actually defines the number that you specified.
That's why more mappers are run and you observe all these numbers.
For more details, please read this post.

Categories