I'm a very newbie of Flink and cluster computing. I spent all day trying to parse correctly on Flink a stupid stream from Kafka with NONE results: It's a bit frustrating...
I've in kafka a stream of JSON-LD messages identified with a string key. I simply would like to retrieve them in Flink and then separate messages with different keys.
1)
Initially I considered to send messages as String instead of JSON-LD. I though was easier...
I tried every deserialiser but none works. The simple deserialiser obsviously works but it completely ignore keys.
I believed I had to use (Flink apparently has just two deserialiser which support keys):
DataStream<Object> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new TypeInformationKeyValueSerializationSchema(String.class, String.class, env.getConfig()), properties))
.rebalance();
stream.print();
But I obtain:
06/12/2017 02:09:12 Source: Custom Source(4/4) switched to FAILED
java.io.EOFException
at org.apache.flink.runtime.util.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:306)
How can I receive stream messages without lose keys?
2)
My kafka producer is implemented in javascript, since Flink support JSONDeserialization I though to send in kafka directly JSON Object.
I'm not sure that's works correctly with JSON-LD but I've used:
json.parse(jsonld_message)
to serialize as json the message. Then I sent this with usual string key.
But in Flink this code doesn't work:
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>("topicTest", new JSONKeyValueDeserializationSchema(false), properties))
.rebalance();
stream.print();
raising a
JsonParserException.
I think first approach is simpler and I prefer it because allows to consider one problem at time (first: receive data, second: reconvert string in JSON-LD with external library I guess).
SOLVED:
Finally I decided to implement a custom deserializer implementing the KeyedDeserializedSchema interface.
In order to use Flink's TypeInformationKeyValueSerializationSchema to read data from Kafka it must be written in a compatible way. Assuming that your key and value are of type String, then the key and value must be written in a way that Flink's StringSerializer understands the data.
Consequently, you have to make sure that your Kafka producer writes the data in a compatible way. Otherwise Flink' won't be able to read the data.
** I faced similar issue. Ideally TypeInformationKeyValueSerializationSchema with String types for keys and values should have been able to read my kafka record which has both keys and values as Strings. but it was not able to and had a EOF exception as pointed out by above post.So this issue is easily reproducible and needs to be fixed. Please let me know if i can be of any help in this process.In the meantime implemented Custom Serializer using
Kafka Deserializer Schema
. Here is the code as there is little doc regarding it to read keys/values and additional things:
**
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.kafka.clients.consumer.ConsumerRecord;
public class CustomKafkaSerializer implements KafkaDeserializationSchema<Tuple2<String,String>> {
#Override
public boolean isEndOfStream(Tuple2<String,String> stringStringPair) {
return false;
}
#Override
public Tuple2<String,String> deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
String key = new String(consumerRecord.key());
String value = new String(consumerRecord.value());
return new Tuple2<>(key,value);
}
#Override
public TypeInformation<Tuple2<String,String>> getProducedType() {
return TypeInformation.of(new TypeHint<Tuple2<String, String>>(){});
}
}
Related
We need to read data from our checkpoints manually for different reasons (let's say we need to change our state object/class structure, so we want to read restore and copy data to a new type of object)
But, while we are reading everything is good, when we want to keep/store it in memory and deploying to flink cluster we get empty list/map. in log we see that we are reading and adding all our data properly to list/map but as soon as our method completes it's work we lost data, list/map is empty :(
val env = ExecutionEnvironment.getExecutionEnvironment();
val savepoint = Savepoint.load(env, checkpointSavepointLocation, new HashMapStateBackend());
private List<KeyedAssetTagWithConfig> keyedAssetsTagWithConfigs = new ArrayList<>();
val keyedStateReaderFunction = new KeyedStateReaderFunctionImpl();
savepoint.readKeyedState("my-uuid", keyedStateReaderFunction)
.setParallelism(1)
.output(new MyLocalCollectionOutputFormat<>(keyedAssetsTagWithConfigs));
env.execute("MyJobName");
private static class KeyedStateReaderFunctionImpl extends KeyedStateReaderFunction<String, KeyedAssetTagWithConfig> {
private MapState<String, KeyedAssetTagWithConfig> liveTagsValues;
private Map<String, KeyedAssetTagWithConfig> keyToValues = new ConcurrentHashMap<>();
#Override
public void open(final Configuration parameters) throws Exception {
liveTagsValues = getRuntimeContext().getMapState(ExpressionsProcessor.liveTagsValuesStateDescriptor);
}
#Override
public void readKey(final String key, final Context ctx, final Collector<KeyedAssetTagWithConfig> out) throws Exception {
liveTagsValues.iterator().forEachRemaining(entry -> {
keyToValues.put(entry.getKey(), entry.getValue());
log.info("key {} -> {} val", entry.getKey(), entry.getValue());
out.collect(entry.getValue());
});
}
public Map<String, KeyedAssetTagWithConfig> getKeyToValues() {
return keyToValues;
}
}
as soon as this code executes I expect having all values inside map which we get from keyedStateReaderFunction.getKeyToValues(). But it returns empty map. However, I see in logs we are reading all of them properly. Even data empty inside keyedAssetsTagWithConfigs list where we are reading output in it.
If anyone has any idea will be very helpful because I get lost, I never had such experience that I put data to map and then I lose it :) When I serialize and write my map or list to text file and then deserialize it from there (using jackson) I see my data exists, but this is not a solution, kind of "workaround"
Thanks in advance
The code you show creates and submits a Flink job to be executed in its own environment orchestrated by the Flink framework: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#flink-application-execution
The job runs independently than the code that builds and submits the Flink job so when you call keyedStateReaderFunction.getKeyToValues(), you are calling the method of the object that was used to build the job, not the actual object that was run in the Flink execution environment.
Your workaround seems like a valid option to me. You can then submit the file with your savepoint contents to your new job to recreate its state as you'd like.
You have an instance of KeyedStateReaderFunctionImpl in the Flink client which gets serialized and sent to each task manager. Each task manager then deserializes a copy of that KeyedStateReaderFunctionImpl and calls its open and readKey methods, and gradually builds up a private Map containing its share of the data extracted from the savepoint/checkpoint.
Meanwhile the original KeyedStateReaderFunctionImpl back in the Flink client has never had its open or readKey methods called, and doesn't hold any data.
In your case the parallelism is one, so there is only one task manager, but in general you will need collect the output from each task manager and assemble together the complete results from these pieces. These results are not available in the flink client process because the work hasn't been done there.
I found a solution, started job in attached mode and collecting results in main thread
val env = ExecutionEnvironment.getExecutionEnvironment();
val configuration = env.getConfiguration();
configuration
.setBoolean(DeploymentOptions.ATTACHED, true);
...
val myresults = dataSource.collect();
Hope will help somebody else because I wasted couple of days while trying to find a soltion.
I am new to Cloud Dataflow / Apache Beam, so the concept/programming is still hazy to me.
What I want to do is that Dataflow listens to Pubsub and gets messages of this format in JSON:
{
"productId": "...",
"productName": "..."
}
And transform that to:
{
"productId": "...",
"productName": "...",
"sku": "...",
"inventory": {
"revenue": <some Double>,
"stocks": <some Integer>
}
}
So the steps needed are:
(IngestFromPubsub) Get records from Pubsub by listening to a topic (1 Pubsub message = 1 record)
(EnrichDataFromAPI)
a. Deserialize the payload's JSON string into Java object
b. By calling an external API, using the sku, I can enrich the data of each record by adding the inventory attribute.
c. Serialize the records again.
(WriteToGCS) Then every x number (can be parameterized) records, I need to write these in Cloud Storage.
Please consider also the trivial case that x=1.
(Does x=1, a good idea? I am afraid there will be too many Cloud Storage writes)
Even though I am a Python guy, I am already having difficulty doing this in Python, more so that I need to do write in Java. I am getting headache reading Beam's example in Java, it's too verbose and difficult to follow. All I understand is that each step is an .apply to the PCollection.
So far, here is the result of my puny effort:
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("IngestFromPubsub", PubsubIO.readStrings().fromTopic(options.getTopic()))
// I don't really understand the next part, I just copied from official documentation and filled in some values
.apply(Window.<String>into(FixedWindows.of(Duration.millis(5000)))
.withAllowedLateness(Duration.millis(5000))
.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.millis(1000)))
.discardingFiredPanes()
)
.apply("EnrichDataFromAPI", ParDo.of(
new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.element();
// help on this part, I heard I need to use Jackson but I don't know, for API HttpClient is sufficient
// ... deserialize, call API, serialize again ...
c.output(enrichedJSONString);
}
}
))
.apply("WriteToGCS",
TextIO.write().withWindowedWrites().withNumShards(1).to(options.getOutput()))
;
PipelineResult result = pipeline.run();
}
Please fill in the missing parts, and also give me a tip on Windowing (e.g. what's the appropriate configuration etc.) and in which steps should I insert/apply it.
I don't think you need any of the windowing in your IngestFromPubsub and EnrichDataFromAPI. The purpose of windowing is to group your records that are nearby in time together into windows so you can compute aggregate computations over them. But since you are not doing any aggregate computations, and are interested in dealing with each record independently, you don't need windows.
Since you are always converting one input record to one output record, your EnrichDataFromAPI should be a MapElements. This should make the code easier.
There are resources out there for processing JSON in Apache Bean Java: Apache Beam stream processing of json data
You don't necessarily need to use Jackson to map the JSON to a Java object. You might be able to manipulate the JSON directly. You can use Java's native JSON API to parse/manipulate/serialize.
I use Apache Thrift protocol for tablet-server and interlanguage integration, and all is OK few years.
Integration is between languages (C#/C++/PC Java/Dalvik Java) and thrift is probably one of simplest and safest. So I want pack-repack sophisticated data structures (and changed over years) with Thrift library. Lets say in thrift terms kind of OfflineTransport or OfflineProtocol.
Scenario:
I want to make backup solution, for example during internet provider failure process data in offline mode: serialise, store, try to process in few ways. For example sent serialised data by normal email via poor backup connection etc.
Question is: where in Thrift philosophy is best extension point for me?
I understand, only part of online protocol is possible to backup offline, ie real time return of value is not possible, that is OK.
Look for serializer. There are misc. implementations but they all share the same common concept to use a buffer or file / stream as transport medium:
Writing data in C#
E.g. we plan to store the bits into a bytes[] buffer. So one could write:
var trans = new TMemoryBuffer();
var prot = new TCompactProtocol( trans);
var instance = GetMeSomeDataInstanceToSerialize();
instance.Write(prot);
Now we can get a hold of the data:
var data = trans.GetBuffer();
Reading data in C#
Reading works similar, except that you need to know from somewhere what root instance to construct:
var trans = new TMemoryBuffer( serializedBytes);
var prot = new TCompactProtocol( trans);
var instance = new MyCoolClass();
instance.Read(prot);
Additional Tweaks
One solution to the chicken-egg problem during load could be to use a union as an extra serialization container:
union GenericFileDataContainer {
1 : MyCoolClass coolclass;
2 : FooBar foobar
// more to come later
}
By always using this container as the root instance during serialization it is easy to add more classes w/o breaking compatibility and there is no need to know up front what exactly is in a file - you just read it and check what element is set in the union.
There is an RPC framework that uses the standard thrift Protocol named "thrifty", and it is the same effect as using thrift IDL to define the service, that is, thrify can be compatible with code that uses thrift IDL, which is very helpful for cross-platform. And has a ThriftSerializer class in it:
[ThriftStruct]
public class LogEntry
{
[ThriftConstructor]
public LogEntry([ThriftField(1)]String category, [ThriftField(2)]String message)
{
this.Category = category;
this.Message = message;
}
[ThriftField(1)]
public String Category { get; }
[ThriftField(2)]
public String Message { get; }
}
ThriftSerializer s = new ThriftSerializer(ThriftSerializer.SerializeProtocol.Binary);
byte[] s = s.Serialize<LogEntry>();
s.Deserialize<LogEntry>(s);
you can try it:https://github.com/endink/Thrifty
I'm having a scheduler that gets our cluster metrics and writes the data onto a HDFS file using an older version of the Cloudera API. But recently, we updated our JARs and the original code errors with an exception.
java.lang.ClassCastException: org.apache.hadoop.io.ArrayWritable cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116)
at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:324)
I need help in using the ParquetHiveRecord class write the data (which are POJOs) in parquet format.
Code sample below:
Writable[] values = new Writable[20];
... // populate values with all values
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value); // <-- Getting exception here
Details of "writer" (of type ParquetWriter):
MessageType schema = MessageTypeParser.parseMessageType(SCHEMA); // SCHEMA is a string with our schema definition
ParquetWriter<ArrayWritable> writer = new ParquetWriter<ArrayWritable>(fileName, new
DataWritableWriteSupport() {
#Override
public WriteContext init(Configuration conf) {
if (conf.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null)
conf.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
}
});
Also, we were using CDH and CM 5.5.1 before, now using 5.8.3
Thanks!
I think you need to use DataWritableWriter rather than ParquetWriter. The class cast exception indicates the write support class is expecting an instance of ParquetHiveRecord instead of ArrayWritable. DataWritableWriter likely breaks down the individual records in ArrayWritable to individual messages in the form of ParquetHiveRecord and sends each to the write support.
Parquet is sort of mind bending at times. :)
Looking at the code of the DataWritableWriteSupport class:
https ://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
You can see it is using the DataWritableWriter, hence you do not need to create an instance of DataWritableWriter, the idea of Write support is that you will be able to write different formats to parquet.
What you do need is to wrap your writables in ParquetHiveRecord
I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)