I have a usecase where i need to move records from hive to kafka. I couldn't find a way where i can directly add a kafka sink to flink dataset.
Hence i used a workaround where i call the map transformation on the flink dataset and inside the map function i use the kafkaProducer.send() command for the given record.
The problem i am facing is that i don't have any way to execute kafkaProducer.flush() on every worker node, hence the number of records written in kafka is always slightly lesser than the number of records in the dataset.
Is there an elegant way to handle this? Any way i can add a kafka sink to dataset in flink? Or a way to call kafkaProducer.flush() as a finalizer?
You could simply create a Sink that will use KafkaProducer under the hood and will write data to Kafka.
Related
In my problem I need to query a database and join the query results with a Kafka data stream in Flink. Currently this is done by storing the query results in a file and then use Flink's readFile functionality to create a DataStream of query results. What could be a better approach to bypass the intermediary step of writing to file and create a DataStream directly from query results?
My current understanding is that I would need to write a custom SourceFunction as suggested here. Is this the right and only way or are there any alternatives?
Are there any good resources for writing the custom SoruceFunctions or should I just look at current implementations for reference and customise them fro my needs?
One straightforward solution would be to use a lookup join, perhaps with caching enabled.
Other possible solutions include kafka connect, or using something like Debezium to mirror the database table into Flink. Here's an example: https://github.com/ververica/flink-sql-CDC.
Let us say we have the follwoing
A Kafka topic X which represents events happening on an entity X.
The Entity X contains a foreign key to another entity Y
I want to enrich this topic X with data from outside Kafka, (i.e. from a CSV file that contains all the entities Y).
The solution I have now is as follows:
Load the CSV in a memory in a dictionary-like structure to make the key-based lookups very fast.
Start consuming from topic X, enrich the data in memory and then write the enriched records back to a new Kafka topic.
I am still evaluating if Kafka streams or Ksql can do the same for me,
My question is there an efficient way to do this with Kafka streams library or KSQL without losing performance?
Sure, you can do something like this
final Map m = new Hashmap();
builder.stream(topic).mapValues(v -> m.get(v)).to(out);
But Kafka Streams is ideally going to be distributed, and your CSV would therefore need to be synced across multiple machines.
Rather than building a map, use a KeyValueStore (this can also be in memory, but using RocksDB is more fault tolerant) via a KTable and use Kafka Connect Spooldir connector to load the CSV to a topic
, build a table from that, then join only topics
I need to know the partition number of Kafka topic to which a Kafka record goes.
Before the execution of
producer.send(record);
Is there any way to know to which partition that record goes?
AFAIK it's not possible using the default round-robin partitioner. If you specify a key you could get the default algorithm from the Producer source code and try to predict (it's kind of hash(key) % num.partitions)
If you use a custom partitioner you process it and already know.
I was wondering about the ProducerInterceptor but it provides information before the partition is assigned as you can see from doc.
https://kafka.apache.org/26/javadoc/org/apache/kafka/clients/producer/ProducerInterceptor.html
I see three options here:
either you specify the partition in your ProducerRecord as shown in the constructor of the ProducerRecord class
define a custom partitioner as shown in another post
make use of the AdminClient API called DescribeTopicsResult to get the information on the number of partitions of a particular topic and then re-apply the default partitioner logic used by Kafka:
kafka.common.utils.Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
I'm currently using Apache Flink for my master thesis and I have to partition it multiple times over an iteration.
I would like to have the same data on the same nodes while processing, but I don't know how I can do that. I think, Flink will always send the data again arbitrarily to the nodes.
Is there a possibility to call multiple times, e.g., partitionByHash(...) and have the data on the same node?
Thanks!
I am writing a spark streaming job that consumes data from Kafka & writes to RDBMS. I am currently stuck because I do not know which would be the most efficient way to store this streaming data into RDBMS.
On searching, I found a few methods -
Using DataFrame
Using JdbcRDD
Creating connection & PreparedStatement inside foreachPartition() of rdd and using PreparedStatement.insertBatch()
I can not figure out which one would be the most efficient method of achieving my goal.
Same is the case with storing & retrieving data from HBase.
Can anyone help me with this ?