I have a problem running a flink job that is basically running a query against a mysql database and then tries to create a temporary view that must be accessed from a different job.
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final TypeInformation<?>[] fieldTypes =
new TypeInformation<?>[] {
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
};
final RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
String selectQuery = "select * from ***";
String driverName = "***";
String sourceDb = "***";
String dbUrl = "jdbc:mysql://mySqlDatabase:3306/";
String dbPassword = "***";
String dbUser = "***";
JdbcInputFormat.JdbcInputFormatBuilder inputBuilder =
JdbcInputFormat.buildJdbcInputFormat()
.setDrivername(driverName)
.setDBUrl(dbUrl + sourceDb)
.setQuery(selectQuery)
.setRowTypeInfo(rowTypeInfo)
.setUsername(dbUser)
.setPassword(dbPassword);
DataStreamSource<Row> source = env.createInput(inputBuilder.finish());
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table customerTable =
tableEnv.fromDataStream(source).as("id", "name", "test");
tableEnv.createTemporaryView("***", ***Table);
Table resultTable = tableEnv.sqlQuery(
"SELECT * FROM ***");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
env.execute();
I'm quite new to Flink, and I'm currently going trough the APIs provided for all of these, but I can't actually understand what I'm doing wrong. In my mind, testing this process by printing the result at the end of the job seems straight forward, but the only thing I get printed is something like this:
2022-02-14 12:22:57,702 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680) switched from INITIALIZING to RUNNING.
2022-02-14 12:22:57,853 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680) switched from RUNNING to FINISHED.
2022-02-14 12:22:57,853 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 (8a1cd3aa6a753c9253926027b1332680).
2022-02-14 12:22:57,856 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source -> DataSteamToTable(stream=default_catalog.default_database.Unregistered_DataStream_Source_1, type=ROW<`f0` INT, `f1` STRING, `f2` STRING> NOT NULL, rowtime=false, watermark=false) -> Calc(select=[f0 AS id, f1 AS name, f2 AS test]) -> TableToDataSteam(type=ROW<`id` INT, `name` STRING, `test` STRING> NOT NULL, rowtime=false) -> Sink: Print to Std. Out (1/1)#0 8a1cd3aa6a753c9253926027b1332680.
The point of this job would be to create a temporary table view used for caching some static data that will be used in other Flink jobs by querying that table view.
For more context on how to use MySQL with Flink, see https://stackoverflow.com/a/71030967/2000823. As a streaming data source, it's more common to work with MySQL's write-ahead-log as a CDC stream, but another approach that is sometimes taken (but not encouraged by Flink's APIs) is to periodically poll MySQL with a SELECT query.
As for what you've tried, using createInput is discouraged for streaming jobs, as this doesn't work with Flink's checkpointing mechanism. Rather than using a hadoop input format, it's better to choose one of the available source connectors.
A temporary view doesn't hold any data, and isn't something that can be accessed from another job. A Flink table, or a view, is metadata describing how data stored somewhere else (e.g., in mysql or kafka) is to be interpreted as a table by Flink. You can store a view in a catalog so that multiple jobs can share its definition, but the underlying data will remain in the external data store, and only the view metadata is stored in the catalog.
So in this case, the job you've written will create a temporary view that is only visible to this job and no others (since it is a temporary view, and not a persistent view stored in a persistent catalog). The output of your job won't be in the log file(s), but will instead go to stdout, or to *.out files in the logging directory of each task manager.
Frist of all, test whether the data of mysql can be read normally
May be you can directly print the source result as follows
DataStreamSource<Row> source = env.createInput(inputBuilder.finish());
source.print()
env.execute();
Related
I am creating a Springboot app and struggling to understand why my GlobalKtable is not updating.
As far as I understand, the global table is supposed to update automatically when the source topic is updated. This is not the case for me.
I did notice the global table becomes populated with new data after I manually delete the state store folder.
I also noted the following error output when the Spring boot app is launched:
**2021-11-27 23:09:18.232 ERROR 17592 --- [ main] o.a.k.s.p.internals.StateDirectory : Failed to change
permissions for the directory d:\kafkastreamsdb
2021-11-27 23:09:18.233 ERROR 17592 --- [ main]
o.a.k.s.p.internals.StateDirectory : Failed to change
permissions for the directory d:\kafkastreamsdb\Kafka-streams**
Seems to me the reason why I only see all the current data in the GlobalKtable after deleting the statestore folder is because the stream is not writing to the state store while the stream is running, but recreates the state store from the source topic after deletion?
So, the issue here is when I try to use the global table as a look-up in the join below, the enriched stream returns NULL for the table values. However, when I delete the state store folder, and restart Springboot, the enriched stream does return the table values.
Just to clarify, new events are continuously sent to the source topic, but this data is only visible in the table after deletion.
Here is my code:
#Service
public class TopologyBuilder2 {
public static Topology build() {
StreamsBuilder builder = new StreamsBuilder();
// Register FakeAddress stream
KStream<String, FakeAddress> streamFakeAddress =
builder.stream("FakeAddress", Consumed.with(Serdes.String(), JsonSerdes.FakeAddress()));
GlobalKTable<String, Greetings> globalGreetingsTable = builder.globalTable(
"Greetings"
, Consumed.with(Serdes.String(), JsonSerdes.Greetings())
, Materialized.<String, Greetings, KeyValueStore<Bytes, byte[]>>as(
"GREETINGS" /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(JsonSerdes.Greetings()) /* value serde */);
// LEFT Key mapper
KeyValueMapper<String, FakeAddress, String> keyMapperFakeAddress =
( leftkey, fakeAddress) -> {
// System.out.println(String.valueOf(fakeAddress.getCountry()));
return String.valueOf(fakeAddress.getCountry());
};
// Value joiner
ValueJoiner<FakeAddress, Greetings, EnrichedCountryGreeting> valueJoinerFakeAddressAndGreetings =
(fakeAddress, greetings) -> new EnrichedCountryGreeting(fakeAddress, greetings);
KStream<String, EnrichedCountryGreeting> enrichedStream
= streamFakeAddress.join(globalGreetingsTable, keyMapperFakeAddress, valueJoinerFakeAddressAndGreetings);
enrichedStream.print(Printed.<String, EnrichedCountryGreeting>toSysOut().withLabel("Stream-enrichedStream: "));
return builder.build();
}
}
Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.
I want to do a simple query in Flink SQL in one table which include a group by statement. But in the results there are duplicate rows for the column specified in the group by statement. Is that because I use a streaming environment and it doesn't remember the state ?
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// configure Kafka consumer
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092"); // Broker default host:port
props.setProperty("group.id", "flink-consumer"); // Consumer group ID
FlinkKafkaConsumer011<BlocksTransactions> flinkBlocksTransactionsConsumer = new FlinkKafkaConsumer011<>(args[0], new BlocksTransactionsSchema(), props);
flinkBlocksTransactionsConsumer.setStartFromEarliest();
DataStream<BlocksTransactions> blocksTransactions = env.addSource(flinkBlocksTransactionsConsumer);
tableEnv.registerDataStream("blocksTransactionsTable", blocksTransactions);
Table sqlResult
= tableEnv.sqlQuery(
"SELECT block_hash, count(tx_hash) " +
"FROM blocksTransactionsTable " +
"GROUP BY block_hash");
DataStream<Test> resultStream = tableEnv
.toRetractStream(sqlResult, Row.class)
.map(t -> {
Row r = t.f1;
String field2 = r.getField(0).toString();
long count = Long.valueOf(r.getField(1).toString());
return new Test(field2, count);
})
.returns(Test.class);
resultStream.print();
resultStream.addSink(new FlinkKafkaProducer011<>("localhost:9092", "TargetTopic", new TestSchema()));
env.execute();
I use the group by statement for the block_hash column but I have several times the same block_hash. This is the result of the print() :
Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1}
Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=1}
Test{field2='0x2c4a021d514e4f8f0beb8f0ce711652304928528487dc7811d06fa77c375b5e1', count=2}
Test{field2='0x780aadc08c294da46e174fa287172038bba7afacf2dff41fdf0f6def03906e60', count=1}
Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1}
Test{field2='0x182d31bd491527e1e93c4e44686057207ee90c6a8428308a2bd7b6a4d2e10e53', count=1}
How can I fix this without using BatchEnvironment ?
A GROUP BY query that runs on a stream must produce updates. Consider the following example:
SELECT user, COUNT(*) FROM clicks GROUP BY user;
Every time, the clicks table receives a new row, the count of the respective user needs to be incremented and updated.
When you convert a Table into a DataStream, these updates must be encoded in the stream. Flink uses retraction and add messages to do that. By calling tEnv.toRetractStream(table, Row.class), you convert the Table table into a DataStream<Tuple2<Boolean, Row>. The Boolean flag is important and indicates whether the Row is added or retracted from the result table.
Given the example query above and the input table clicks as
user | ...
------------
Bob | ...
Liz | ...
Bob | ...
You will receive the following retraction stream
(+, (Bob, 1)) // add first result for Bob
(+, (Liz, 1)) // add first result for Liz
(-, (Bob, 1)) // remove outdated result for Bob
(+, (Bob, 2)) // add updated result for Bob
You need to actively maintain the result yourself and add and remove rows as instructed by the Boolean flag of the retraction stream.
I use Spark SQL in a Spark Streaming Job to search in a Hive table.
Kafka streaming works fine without problems. If I run hiveContext.runSqlHive(sqlQuery); outside directKafkaStream.foreachRDD it works fine without problems. But I need the Hive-Table lookup inside the streaming job. The use of JDBC (jdbc:hive2://) would work, but I want to use the Spark SQL.
The significant places of my source code looks as follows:
// set context
SparkConf sparkConf = new SparkConf().setAppName(appName).set("spark.driver.allowMultipleContexts", "true");
SparkContext sparkSqlContext = new SparkContext(sparkConf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(batchDuration));
HiveContext hiveContext = new HiveContext(sparkSqlContext);
// Initialize Direct Spark Kafka Stream. Starts from top
JavaPairInputDStream<String, String> directKafkaStream =
KafkaUtils.createDirectStream(streamingContext,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
rdd.foreachPartition(tuple2Iterator -> {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String sqlQuery = "SELECT something FROM somewhere";
Seq<String> resultSequence = hiveContext.runSqlHive(sqlQuery);
List<String> result = scala.collection.JavaConversions.seqAsJavaList(resultSequence);
});
return null;
});
// Start the computation
streamingContext.start();
streamingContext.awaitTermination();
I get no meaningful error, even if I surround with try-catch.
I hope someone can help - Thanks.
//edit:
The solution looks like:
// work on stream
directKafkaStream.foreachRDD((Function<JavaPairRDD<String, String>, Void>) rdd -> {
// driver
Map<String, String> lookupMap = getResult(hiveContext); //something with hiveContext.runSqlHive(sqlQuery);
rdd.foreachPartition(tuple2Iterator -> {
// worker
while (tuple2Iterator != null && tuple2Iterator.hasNext()) {
// get message
Tuple2<String, String> item = tuple2Iterator.next();
// lookup
String result = lookupMap.get(item._2());
}
});
return null;
});
Just because you want to use Spark SQL it won't make it possible. Spark's rule number one is no nested actions, transformations or distributed data structures.
If you can express your query for example as join you can use push it to one level higher to foreachRDD and this pretty much exhaust your options to use Spark SQL here:
directKafkaStream.foreachRDD(rdd ->
hiveContext.runSqlHive(sqlQuery)
rdd.foreachPartition(...)
)
Otherwise direct JDBC connection can be a valid option.
I have a big table in hbase that name is UserAction, and it has three column families(song,album,singer). I need to fetch all of data from 'song' column family as a JavaRDD object. I try this code, but it's not efficient. Is there a better solution to do this ?
static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster(
"local[4]");
static JavaSparkContext jsc = new JavaSparkContext(sparkConf);
static void getRatings() {
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "UserAction");
conf.set(TableInputFormat.SCAN_COLUMN_FAMILY, "song");
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
.newAPIHadoopRDD(
conf,
TableInputFormat.class,
org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);
JavaRDD<Rating> count = hBaseRDD
.map(new Function<Tuple2<ImmutableBytesWritable, Result>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(
Tuple2<ImmutableBytesWritable, Result> t)
throws Exception {
Result r = t._2;
int user = Integer.parseInt(Bytes.toString(r.getRow()));
ArrayList<Rating> ra = new ArrayList<>();
for (Cell c : r.rawCells()) {
int product = Integer.parseInt(Bytes
.toString(CellUtil.cloneQualifier(c)));
double rating = Double.parseDouble(Bytes
.toString(CellUtil.cloneValue(c)));
ra.add(new Rating(user, product, rating));
}
return jsc.parallelize(ra);
}
})
.reduce(new Function2<JavaRDD<Rating>, JavaRDD<Rating>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(JavaRDD<Rating> r1,
JavaRDD<Rating> r2) throws Exception {
return r1.union(r2);
}
});
jsc.stop();
}
Song column family scheme design is :
RowKey = userID, columnQualifier = songID and value = rating.
UPDATE: OK I see your problem now, for some crazy reason your turning your arrays into RDDs return jsc.parallelize(ra);. Why are you doing that?? Why are you creating an RDD of RDDs?? Why not leave them as arrays? When you do the reduce you can then concatenate the arrays. An RDD is a Resistant Distributed Dataset - it does not make logical sense to have a Distributed Dataset of Distributed Datasets. I'm surprised your job even runs and doesn't crash! Anyway that's why your job is so slow.
Anyway, in Scala after your map, you would just do a flatMap(identity) and that would concatenate all your lists together.
I don't really understand why you need to do a reduce, maybe that is where you have something inefficient going on. Here is my code to read HBase tables (its generalized - i.e. works for any scheme). One thing to be sure of is to make sure that when you read the HBase table you ensure the number of partitions is suitable (usually you want a lot).
type HBaseRow = java.util.NavigableMap[Array[Byte],
java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]]
// Map(CF -> Map(column qualifier -> Map(timestamp -> value)))
type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]]
def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
navMap.asScala.toMap.map(cf =>
(cf._1, cf._2.asScala.toMap.map(col =>
(col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))
def readTableAll(table: String): RDD[(Array[Byte], CFTimeseriesRow)] = {
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, table)
sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap)))
}
As you can see, I have no need for a reduce in my code. The methods are pretty self explainatory. I could dig further into your code, but I lack the patience to read Java as it's so epically verbose.
I have some more code specifically for fetching the most recent elements from the row (rather than the entire history). Let me know if you want to see that.
Finally, recommend you look into using Cassandra over HBase as datastax is partnering with databricks.