I am new to Cloud Dataflow / Apache Beam, so the concept/programming is still hazy to me.
What I want to do is that Dataflow listens to Pubsub and gets messages of this format in JSON:
{
"productId": "...",
"productName": "..."
}
And transform that to:
{
"productId": "...",
"productName": "...",
"sku": "...",
"inventory": {
"revenue": <some Double>,
"stocks": <some Integer>
}
}
So the steps needed are:
(IngestFromPubsub) Get records from Pubsub by listening to a topic (1 Pubsub message = 1 record)
(EnrichDataFromAPI)
a. Deserialize the payload's JSON string into Java object
b. By calling an external API, using the sku, I can enrich the data of each record by adding the inventory attribute.
c. Serialize the records again.
(WriteToGCS) Then every x number (can be parameterized) records, I need to write these in Cloud Storage.
Please consider also the trivial case that x=1.
(Does x=1, a good idea? I am afraid there will be too many Cloud Storage writes)
Even though I am a Python guy, I am already having difficulty doing this in Python, more so that I need to do write in Java. I am getting headache reading Beam's example in Java, it's too verbose and difficult to follow. All I understand is that each step is an .apply to the PCollection.
So far, here is the result of my puny effort:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("IngestFromPubsub", PubsubIO.readStrings().fromTopic(options.getTopic()))
// I don't really understand the next part, I just copied from official documentation and filled in some values
.apply(Window.<String>into(FixedWindows.of(Duration.millis(5000)))
.withAllowedLateness(Duration.millis(5000))
.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.millis(1000)))
.discardingFiredPanes()
)
.apply("EnrichDataFromAPI", ParDo.of(
new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.element();
// help on this part, I heard I need to use Jackson but I don't know, for API HttpClient is sufficient
// ... deserialize, call API, serialize again ...
c.output(enrichedJSONString);
}
}
))
.apply("WriteToGCS",
TextIO.write().withWindowedWrites().withNumShards(1).to(options.getOutput()))
;
PipelineResult result = pipeline.run();
}
Please fill in the missing parts, and also give me a tip on Windowing (e.g. what's the appropriate configuration etc.) and in which steps should I insert/apply it.
I don't think you need any of the windowing in your IngestFromPubsub and EnrichDataFromAPI. The purpose of windowing is to group your records that are nearby in time together into windows so you can compute aggregate computations over them. But since you are not doing any aggregate computations, and are interested in dealing with each record independently, you don't need windows.
Since you are always converting one input record to one output record, your EnrichDataFromAPI should be a MapElements. This should make the code easier.
There are resources out there for processing JSON in Apache Bean Java: Apache Beam stream processing of json data
You don't necessarily need to use Jackson to map the JSON to a Java object. You might be able to manipulate the JSON directly. You can use Java's native JSON API to parse/manipulate/serialize.
Related
I'm a very newbie of Flink and cluster computing. I spent all day trying to parse correctly on Flink a stupid stream from Kafka with NONE results: It's a bit frustrating...
I've in kafka a stream of JSON-LD messages identified with a string key. I simply would like to retrieve them in Flink and then separate messages with different keys.
1)
Initially I considered to send messages as String instead of JSON-LD. I though was easier...
I tried every deserialiser but none works. The simple deserialiser obsviously works but it completely ignore keys.
I believed I had to use (Flink apparently has just two deserialiser which support keys):
DataStream<Object> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new TypeInformationKeyValueSerializationSchema(String.class, String.class, env.getConfig()), properties))
.rebalance();
stream.print();
But I obtain:
06/12/2017 02:09:12 Source: Custom Source(4/4) switched to FAILED
java.io.EOFException
at org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:306)
How can I receive stream messages without lose keys?
2)
My kafka producer is implemented in javascript, since Flink support JSONDeserialization I though to send in kafka directly JSON Object.
I'm not sure that's works correctly with JSON-LD but I've used:
json.parse(jsonld_message)
to serialize as json the message. Then I sent this with usual string key.
But in Flink this code doesn't work:
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new JSONKeyValueDeserializationSchema(false), properties))
.rebalance();
stream.print();
raising a
JsonParserException.
I think first approach is simpler and I prefer it because allows to consider one problem at time (first: receive data, second: reconvert string in JSON-LD with external library I guess).
SOLVED:
Finally I decided to implement a custom deserializer implementing the KeyedDeserializedSchema interface.
In order to use Flink's TypeInformationKeyValueSerializationSchema to read data from Kafka it must be written in a compatible way. Assuming that your key and value are of type String, then the key and value must be written in a way that Flink's StringSerializer understands the data.
Consequently, you have to make sure that your Kafka producer writes the data in a compatible way. Otherwise Flink' won't be able to read the data.
** I faced similar issue. Ideally TypeInformationKeyValueSerializationSchema with String types for keys and values should have been able to read my kafka record which has both keys and values as Strings. but it was not able to and had a EOF exception as pointed out by above post.So this issue is easily reproducible and needs to be fixed. Please let me know if i can be of any help in this process.In the meantime implemented Custom Serializer using
Kafka Deserializer Schema
. Here is the code as there is little doc regarding it to read keys/values and additional things:
**
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.kafka.clients.consumer.ConsumerRecord;
public class CustomKafkaSerializer implements KafkaDeserializationSchema<Tuple2<String,String>> {
#Override
public boolean isEndOfStream(Tuple2<String,String> stringStringPair) {
return false;
}
#Override
public Tuple2<String,String> deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
String key = new String(consumerRecord.key());
String value = new String(consumerRecord.value());
return new Tuple2<>(key,value);
}
#Override
public TypeInformation<Tuple2<String,String>> getProducedType() {
return TypeInformation.of(new TypeHint<Tuple2<String, String>>(){});
}
}
I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. But recently, we updated our JARs and the original code errors with an exception.
java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)
I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format.
Code sample below:
Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here
Details of "writer" (of type ParquetWriter):
MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new
DataWritableWriteSupport() {
#Override
public WriteContext init(Configuration conf) {
if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
}
});
Also, we were using CDH and CM 5.5.1 before, now using 5.8.3
Thanks!
I think you need to use DataWritableWriter rather than ParquetWriter. The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable. DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support.
Parquet is sort of mind bending at times. :)
Looking at the code of the DataWritableWriteSupport class:
https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet.
What you do need is to wrap your writables in ParquetHiveRecord
I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)
I need to use a non-serialisable 3rd party class in my functions on all executors in Spark, for example:
JavaRDD<String> resRdd = origRdd
.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String t) throws Exception {
//A DynamoDB mapper I don't want to initialise every time
DynamoDBMapper mapper = new DynamoDBMapper(new AmazonDynamoDBClient(credentials));
Set<String> userFav = mapper.load(userDataDocument.class, userId).getFav();
return userFav;
}
});
I would like to have a static DynamoDBMapper mapper which I initialise once for every executor and be able to use it over and over again.
Since it's not a serialisable, I can't initialise it once in the drive and broadcast it.
note: this is an answer here (What is the right way to have a static object on all workers) but it's only for Scala.
You can use mapPartition or foreachPartition. Here is a snippet taken from Learning Spark
By using partition- based operations, we can share a connection pool
to this database to avoid setting up many connections, and reuse our
JSON parser. As Examples 6-10 through 6-12 show, we use the
mapPartitions() function, which gives us an iterator of the elements
in each partition of the input RDD and expects us to return an
iterator of our results.
This allows us to initialize one connection per executor, then iterate over the elements in the partition however you would like. This is very useful for saving data into some external database or for expensive reusable object creation.
Here is a simple scala example taken from the linked book. This can be translated to java if needed. Just here to show a simple use case of mapPartition and foreachPartition.
ipAddressRequestCount.foreachRDD { rdd => rdd.foreachPartition { partition =>
// Open connection to storage system (e.g. a database connection)
partition.foreach { item =>
// Use connection to push item to system
}
// Close connection
}
}
Here is a link to a java example.
data: [
{
type: "earnings"
info: {
earnings: 45.6
dividends: 4052.94
gains: 0
expenses: 3935.24
shares_bought: 0
shares_bought_user_count: 0
shares_sold: 0
shares_sold_user_count: 0
}
created: "2011-07-04 11:46:17"
}
{
type: "mentions"
info: [
{
type_id: "twitter"
mentioner_ticker: "LOANS"
mentioner_full_name: "ERICK STROBEL"
}
]
created: "2011-06-10 23:03:02"
}
]
Here's my problem : like you can see the "info" is different in each of one, one is a json object, and one is a json array, i usually choose Gson to take the data, but with Gson we can't do this kind of thing . How can i make it work ?
If you want to use Gson, then to handle the issue where the same JSON element value is sometimes an array and sometimes an object, custom deserialization processing is necessary. I posted an example of this in the Parsing JSON with GSON, object sometimes contains list sometimes contains object post.
If the "info" element object has different elements based on type, and so you want polymorphic deserialization behavior to deserialize to the correct type of object, with Gson you'll also need to implement custom deserialization processing. How to do that has been covered in other StackOverflow.com posts. I posted a link to four different such questions and answers (some with code examples) in the Can I instantiate a superclass and have a particular subclass be instantiated based on the parameters supplied thread. In this thread, the particular structure of the JSON objects to deserialize varies from the examples I just linked, because the element to indicate the type is external of the object to be deserialized, but if you can understand the other examples, then handling the problem here should be easy.
Both key and value have to be within quotes, and you need to separate definitions with commas:
{
"key0": "value0",
"key1": "value1",
"key2": [ "value2_0", "value2_1" ]
}
That should do the trick!
The info object should be of the same type with every type.
So check the type first. Pseudocode:
if (data.get('type').equals("mentions") {
json_arr = data.get('info');
}
else if (data.get('type').equals("earnings") {
json_obj = data.get('info');
}
I'm not sure that helps, cause I'm not sure I understand the question.
Use simply org.json classes that are available in android: http://developer.android.com/reference/org/json/package-summary.html
You will get a dynamic structure that you will be able to traverse, without the limitations of strong typing.....
This is not a "usual" way of doing things in Java (where strong typing is default) but IMHO in many situations even in Java it is ok to do some dynamic processing. Flexibility is better but price to pay is lack of compile-time type verification... Which in many cases is ok.
If changing libraries is an option you could have a look at Jackson, its Simple Data Binding mode should allow you to deserialize an object like you describe about. A part of the doc that is probably quite important is this, your example would already need JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES to work...
Clarification for Bruce: true, in Jackson's Full Data Binding mode, but not in Simple Data Binding mode. This is simple data binding:
public static void main(String[] args) throws IOException {
File src = new File("test.json");
ObjectMapper mapper = new ObjectMapper();
mapper.configure(JsonParser.Feature. ALLOW_UNQUOTED_FIELD_NAMES, true);
mapper.configure(JsonParser.Feature.ALLOW_COMMENTS,true);
Object root = mapper.readValue(src, Object.class);
Map<?,?> rootAsMap = mapper.readValue(src, Map.class);
System.out.println(rootAsMap);
}
which with OP's sightly corrected sample JSON data gives:
{data=[{type=earnings, info={earnings=45.6, dividends=4052.94, gains=0,
expenses=3935.24, shares_bought=0, shares_bought_user_count=0, shares_sold=0,
shares_sold_user_count=0}, created=2011-07-04 11:46:17}, {type=mentions,
info=[{type_id=twitter, mentioner_ticker=LOANS, mentioner_full_name=ERICK STROBEL}],
created=2011-06-10 23:03:02}]}
OK, some hand-coding needed to wire up this Map to the original data, but quite often less is more and such mapping code, being dead simple has the advantage of being very easy to read/maintain later on.