Context:
We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.
What I have tried:
Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:
// Creates a side input that refreshes the schema every minute
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
Map<String, byte[]> dataBlobs = ImmutableMap.of(
"version", Longs.toByteArray(ctx.element().byteValue()),
"avroSchemaBlob", avroSchemaBlob,
"fileDescriptorSetBlob", fileDescriptorSetBlob,
"depsBlob", depsBlob);
ctx.output(dataBlobs);
}
}))
.apply(View.asSingleton());
"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.
However, this approach failed from the exception:
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
I then tried writing my own Combine Globally function like so:
static class GetLatestVersion implements SerializableFunction<Iterable<Map<String, byte[]>>, Map<String, byte[]>> {
#Override
public Map<String, byte[]> apply(Iterable<Map<String, byte[]>> versions) {
Map<String, byte[]> result = Maps.newHashMap();
Long maxVersion = Long.MIN_VALUE;
for (Map<String, byte[]> version: versions){
Long currentVersion = Longs.fromByteArray(version.get("version"));
logger.info("Side input version: " + currentVersion);
if (currentVersion > maxVersion) {
result = version;
maxVersion = currentVersion;
}
}
return result;
}
}
But it still triggers the same exception........
I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).
My questions:
Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?
Thanks
Use .apply(View.asMap()) instead of .apply(View.asSingleton());
This is the full example:
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, KV<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
ctx.output(KV.of("version", Longs.toByteArray(ctx.element().byteValue())));
ctx.output(KV.of("avroSchemaBlob", avroSchemaBlob));
ctx.output(KV.of("fileDescriptorSetBlob", fileDescriptorSetBlob));
ctx.output(KV.of("depsBlob", depsBlob));
}
}))
.apply(View.asMap());
You can use the map from the side inputs as described in documentation.
Apache Beam version 2.34.0
Related
I have two DataStreams, the first one called DataStream<String> source which receive records from a message broker, and the second one is a SingleOutputOperator<Event> events, which is the result of mapping the source into Event.class.
I have a uses cases that needs to use SingleOutputOperator<Event> events and other that uses DataStream<String> source. In one of the use cases that use DataStream<String> source, I need to join the SingleOutputOperator<String> result after apply some filters and to avoid to map the source again into Event.class as I already have that operation done and that Stream, I need to search each record into the SingleOutputOperator<String> result into the SingleOutputOperator<Event> events and the apply another map to export a SingleOutputOperator<EventOutDto> out.
This is the idea as example:
DataStream<String> source = env.readFrom(source);
SingleOutputOperator<Event> events = source.map(s -> mapper.readValue(s, Event.class));
public void filterAndJoin(DataStream<String> source, SingleOutputOperator<Event> events){
SingleOutputOperator<String> filtered = source.filter(s -> new FilterFunction());
SingleOutputOperator<EventOutDto> result = (this will be the result of search each record
based on id in the filtered stream into the events stream where the id must match and return the event if found)
.map(event -> new EventOutDto(event)).addSink(new RichSinkFunction());
}
I have this code:
filtered.join(events)
.where(k -> {
JsonNode tree = mapper.readTree(k);
String id = "";
if (tree.get("Id") != null) {
id = tree.get("Id").asText();
}
return id;
})
.equalTo(e -> {
return e.Id;
})
.window(TumblingEventTimeWindows.of(Time.seconds(1)))
.apply(new JoinFunction<String, Event, BehSingleEventTriggerDTO>() {
#Override
public EventOutDto join(String s, Event event) throws Exception {
return new EventOutDto(event);
}
})
.addSink(new SinkFunction());
In the above code all works fine, the ids are the same, so basically the where(id).equalTo(id) should work, but the process never reaches the apply function.
Observation: Watermark are assigned with the same timestamp
Questions:
Any idea why?
Am I explained myself fine?
I solved the join by doing this:
SingleOutputStreamOperator<ObjectDTO> triggers = candidates
.keyBy(new KeySelector())
.intervalJoin(keyedStream.keyBy(e -> e.Id))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process(new new ProcessFunctionOne())
.keyBy(k -> k.otherId)
.process(new ProcessFunctionTwo());
Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.
I've developed a couple of C++ apps that produce and consume Kafka messages (using cppkafka) embedding Protobuf3 messages. Both work fine. The producer's relevant code is:
std::string kafkaString;
cppkafka::MessageBuilder *builder;
...
solidList->SerializeToString(&kafkaString);
builder->payload(kafkaString);
Protobuf objects are serialized to string and inserted as Kafka payload. Everything works fine up to this point. Now, I'm trying to develop a consumer for that in Java. The relevant code should be:
KafkaConsumer<Long, String> consumer=new KafkaConsumer<Long, String>(properties);
....
ConsumerRecords<Long, String> records = consumer.poll(100);
for (ConsumerRecord<Long, String> record : records) {
SolidList solidList = SolidList.parseFrom(record.value());
...
but that fails at compile time: parseFrom complains: The method parseFrom(ByteBuffer) in the type Solidlist.SolidList is not applicable for the arguments (String). So, I try using a ByteBuffer:
KafkaConsumer<Long, ByteBuffer> consumer=new KafkaConsumer<Long, ByteBuffer>(properties);
....
ConsumerRecords<Long, ByteBuffer> records = consumer.poll(100);
for (ConsumerRecord<Long, ByteBuffer> record : records) {
SolidList solidList = SolidList.parseFrom(record.value());
...
Now, the error is on execution time, still on parseFrom(): Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.nio.ByteBuffer. I know it is a java.lang.String!!! So, I get back to the original, and try using it as a byte array:
SolidList solidList = SolidList.parseFrom(record.value().getBytes());
Now, the error is on execution time: Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type..
The protobuf documentation states for the C++ serialization: bool SerializeToString(string output) const;: serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.*
TL;DR: In consequence, how should I interpret the protobuf C++ "binary bytes" in Java?
This seems related (it is the opposite) but doesn't help: Protobuf Java To C++ Serialization [Binary]
Thanks in advance.
Try implementing a Deserializer and pass it to KafkaConsumer constructor as value deserializer. It could look like this:
class SolidListDeserializer implements Deserializer<SolidList> {
public SolidList deserialize(final String topic, byte[] data) {
return SolidList.parseFrom(data);
}
...
}
...
KafkaConsumer<Long, SolidList> consumer = new KafkaConsumer<>(props, new LongDeserializer(), new SolidListDeserializer())
You can read kafka as ConsumerRecords<Long, String>. And then SolidList.parseFrom(ByteBuffer.wrap(record.value().getBytes("UTF-8")));
I have a simple program because I'm trying to receive data using kafka. When I start a kafka producer and I send data, for example: "Hello", I get this when I print the message: (null, Hello). And I don't know why this null appears. Is there any way to avoid this null? I think it's due to Tuple2<String, String>, the first parameter, but I only want to print the second parameter. And another thing, when I print that using System.out.println("inside map "+ message); it does not appear any message, does someone know why? Thanks.
public static void main(String[] args){
SparkConf sparkConf = new SparkConf().setAppName("org.kakfa.spark.ConsumerData").setMaster("local[4]");
// Substitute 127.0.0.1 with the actual address of your Spark Master (or use "local" to run in local mode
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = KafkaProperties.TOPIC.split(",");
for (String topic: topics) {
topicMap.put(topic, KafkaProperties.NUM_THREADS);
}
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sparkConf);
System.out.println("+++++++++++ cassandra connector created ++++++++++++++++++++++++++++");
/* Receive kafka inputs */
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, KafkaProperties.ZOOKEEPER, KafkaProperties.GROUP_CONSUMER, topicMap);
System.out.println("+++++++++++++ streaming-kafka connection done +++++++++++++++++++++++++++");
JavaDStream<String> lines = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
System.out.println("inside map "+ message);
return message._2();
}
}
);
messages.print();
jssc.start();
jssc.awaitTermination();
}
Q1) Null values:
Messages in Kafka are Keyed, that means they all have a (Key, Value) structure.
When you see (null, Hello) is because the producer published a (null,"Hello") value in a topic.
If you want to omit the key in your process, map the original Dtream to remove the key: kafkaDStream.map( new Function<String,String>() {...})
Q2) System.out.println("inside map "+ message); does not print. A couple of classical reasons:
Transformations are applied in the executors, so when running in a cluster, that output will appear in the executors and not on the master.
Operations are lazy and DStreams need to be materialized for operations to be applied.
In this specific case, the JavaDStream<String> lines is never materialized i.e. not used for an output operation. Therefore the map is never executed.
I have a big table in hbase that name is UserAction, and it has three column families(song,album,singer). I need to fetch all of data from 'song' column family as a JavaRDD object. I try this code, but it's not efficient. Is there a better solution to do this ?
static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster(
"local[4]");
static JavaSparkContext jsc = new JavaSparkContext(sparkConf);
static void getRatings() {
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "UserAction");
conf.set(TableInputFormat.SCAN_COLUMN_FAMILY, "song");
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
.newAPIHadoopRDD(
conf,
TableInputFormat.class,
org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);
JavaRDD<Rating> count = hBaseRDD
.map(new Function<Tuple2<ImmutableBytesWritable, Result>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(
Tuple2<ImmutableBytesWritable, Result> t)
throws Exception {
Result r = t._2;
int user = Integer.parseInt(Bytes.toString(r.getRow()));
ArrayList<Rating> ra = new ArrayList<>();
for (Cell c : r.rawCells()) {
int product = Integer.parseInt(Bytes
.toString(CellUtil.cloneQualifier(c)));
double rating = Double.parseDouble(Bytes
.toString(CellUtil.cloneValue(c)));
ra.add(new Rating(user, product, rating));
}
return jsc.parallelize(ra);
}
})
.reduce(new Function2<JavaRDD<Rating>, JavaRDD<Rating>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(JavaRDD<Rating> r1,
JavaRDD<Rating> r2) throws Exception {
return r1.union(r2);
}
});
jsc.stop();
}
Song column family scheme design is :
RowKey = userID, columnQualifier = songID and value = rating.
UPDATE: OK I see your problem now, for some crazy reason your turning your arrays into RDDs return jsc.parallelize(ra);. Why are you doing that?? Why are you creating an RDD of RDDs?? Why not leave them as arrays? When you do the reduce you can then concatenate the arrays. An RDD is a Resistant Distributed Dataset - it does not make logical sense to have a Distributed Dataset of Distributed Datasets. I'm surprised your job even runs and doesn't crash! Anyway that's why your job is so slow.
Anyway, in Scala after your map, you would just do a flatMap(identity) and that would concatenate all your lists together.
I don't really understand why you need to do a reduce, maybe that is where you have something inefficient going on. Here is my code to read HBase tables (its generalized - i.e. works for any scheme). One thing to be sure of is to make sure that when you read the HBase table you ensure the number of partitions is suitable (usually you want a lot).
type HBaseRow = java.util.NavigableMap[Array[Byte],
java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]]
// Map(CF -> Map(column qualifier -> Map(timestamp -> value)))
type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]]
def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
navMap.asScala.toMap.map(cf =>
(cf._1, cf._2.asScala.toMap.map(col =>
(col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))
def readTableAll(table: String): RDD[(Array[Byte], CFTimeseriesRow)] = {
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, table)
sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap)))
}
As you can see, I have no need for a reduce in my code. The methods are pretty self explainatory. I could dig further into your code, but I lack the patience to read Java as it's so epically verbose.
I have some more code specifically for fetching the most recent elements from the row (rather than the entire history). Let me know if you want to see that.
Finally, recommend you look into using Cassandra over HBase as datastax is partnering with databricks.