How to access metadata stored by Spark Streaming custom receiver?

How to access metadata stored by Spark Streaming custom receiver? - java

Spark Streaming provides the ability to create a custom receiver, as detailed here. To store the data received by the receiver into Spark, the store(data) method needs to be used.
The data I am storing to Spark has certain properties that are associated with it. The Spark Receiver class, extended by the custom receiver, provides several store methods of the form store(data, metadata), that imply that metadata/properties can be stored with the data. The code extract below shows how I used this method to store the data and its metadata/properties.
public class CustomReceiver extends Receiver<String> {
public CustomReceiver() {
super(StorageLevel.MEMORY_AND_DISK_2());
}
#Override
public void onStart() {
new Thread() {
#Override
public void run() {
try {
receive();
} catch (IOException e) {
restart("Error connecting: ", e);
}
}
}.start();
}
#Override
public void onStop() {
// Not needed as receive() method closes resources when stopped
}
private void receive() throws IOException {
String str = getData();
Map<String, String> metadata = getMetadata();
Iterator<String> it = Arrays.asList(str.split("\n\r")).iterator();
store(it, metadata);
if (isStopped()) {
closeConnections();
}
}
}
This stored data is accessed, from another class, as shown in the following code extract:
private void testCustomReceiver() {
JavaDStream<String> custom = ssc.receiverStream(new CustomReceiver());
JavaDStream<String> processedInput = custom.flatMap(row -> {
return Arrays.asList(row.split("\\r?\\n"));
});
processedInput.print();
}
Which now brings us to my question: How can the metadata/properties stored with the data in the custom receiver be accessed from the testCustomReceiver() method shown above?
I have tried searching through the documentation and exploring the JavaDStream object in the debugger to search for the metadata, but to no avail. Any help or advice on this matter would be greatly appreciated, thank you.

I've been digging around in the Spark code, and I've come to the belief that you can't ever access it again. In fact, I do not believe it is ever used.
The supervisor for your Receiver takes the metadataOption and drops it into a ReceivedBlockInfo (which is private to org.apache.spark.streaming). From there, it goes... nowhere. I can find no reference to ReceivedBlockInfo.metadataOption in the streaming codebase. It's hypothetically possible that ReceivedBlockInfo is serialized then deserialized into a different class, or some funky reflection retrieves the metadata, but both of those are such antipatterns that I wouldn't count on it happening.
Why is it there? I believe the intention was for it to be part of the Metadata Checkpointing system, but it either was never hooked up, or the connection between Receiver metadata and stream checkpointing was severed.
Either way, block metadata is gone once the block is dropped into the stream.

Related

How to use non-keyed state with Kafka Consumer in Flink?

I'm trying to implement (just starting work with Java and Flink) a non-keyed state in KafkaConsumer object, since in this stage no keyBy() in called. This object is the front end and the first module to handle messages from Kafka.
SourceOutput is a proto file representing the message.
I have the KafkaConsumer object :
public class KafkaSourceFunction extends ProcessFunction<byte[], SourceOutput> implements Serializable
{
#Override
public void processElement(byte[] bytes, ProcessFunction<byte[], SourceOutput>.Context
context, Collector<SourceOutput> collector) throws Exception
{
// Here, I want to call to sorting method
collector.collect(output);
}
}
I have an object (KafkaSourceSort) that do all the sorting and should keep the unordered message in priorityQ in the state and also responsible to deliver the message if it comes in the right order thru the collector.
class SessionInfo
{
public PriorityQueue<SourceOutput> orderedMessages = null;
public void putMessage(SourceOutput Msg)
{
if(orderedMessages == null)
orderedMessages = new PriorityQueue<SourceOutput>(new SequenceComparator());
orderedMessages.add(Msg);
}
}
public class KafkaSourceState implements Serializable
{
public TreeMap<String, SessionInfo> Sessions = new TreeMap<>();
}
I read that I need to use a non-keyed state (ListState) which should contain a map of sessions while each session contains a priorityQ holding all messages related to this session.
I found an example so I implement this:
public class KafkaSourceSort implements SinkFunction<KafkaSourceSort>,
CheckpointedFunction
{
private transient ListState<KafkaSourceState> checkpointedState;
private KafkaSourceState state;
#Override
public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception
{
checkpointedState.clear();
checkpointedState.add(state);
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception
{
ListStateDescriptor<KafkaSourceState> descriptor =
new ListStateDescriptor<KafkaSourceState>(
"KafkaSourceState",
TypeInformation.of(new TypeHint<KafkaSourceState>() {}));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored())
{
state = (KafkaSourceState) checkpointedState.get();
}
}
#Override
public void invoke(KafkaSourceState value, SinkFunction.Context contex) throws Exception
{
state = value;
// ...
}
}
I see that I need to implement an invoke message which probably will be called from processElement() but the signature of invoke() doesn't contain the collector and I don't understand how to do so or even if I did OK till now.
Please, a help will be appreciated.
Thanks.

A SinkFunction is a terminal node in the DAG that is your job graph. It doesn't have a Collector in its interface because it cannot emit anything downstream. It is expected to connect to an external service or data store and send data there.
If you share more about what you are trying to accomplish perhaps we can offer more assistance. There may be an easier way to go about this.

Is there a way to send data to a Kafka topic directly from within Processor?

I'm trying to implement the following logic with help of Kafka Streams:
Listen to some reference data from topic eg. ref-data-topic and creates a global StateStore from it.
Listen to messages from another topic data-topic which must be validated against ref data and either sent to success or errors topics.
Here is example pseudocode:
class SomeProcessor implements Processor<String, String> {
private KeyValueStore<String, String> refDataStore;
#Override
public void init(final ProcessorContext context) {
refDataStore = (KeyValueStore) context.getStateStore("ref-data-store");
}
#Override
public void process(String key String value) {
Object refData = refDataStore.get("some_key");
// business logic here
if(ok) {
sendValueToTopic("success");
} else {
sendValueToTopic("errors");
}
}
}
Or what would be the canonical way to achieve such a desired behavior?
Just like an alternative that I have now in my mind is to enrich data within Processor with validation info and send everything then into only one topic, making a client to deal with e.g. validationStatus in the received message.
Although, I really would like to have a solution with two topics because e.g in such a case I could, using Kafka Connect, link success topic directly with some datastore and deal with error topic somehow differently. In the approach with only one topic, again, I have no idea how to achieve this "store_only_successfully_validated_entities" use case.
Any ideas and suggestions?

If you use Processor API, you can forward data to different processor by name:
class SomeProcessor implements Processor<String, String> {
private KeyValueStore<String, String> refDataStore;
private ProcessorContext processorContext;
#Override
public void init(final ProcessorContext context) {
refDataStore = (KeyValueStore) context.getStateStore("ref-data-store");
processorContext = context;
}
#Override
public void process(String key String value) {
Object refData = refDataStore.get("some_key");
// business logic here
if(ok) {
processorContext.forward(key, value, To.child("success"));
} else {
processorContext.forward(key, value, To.child("error"));
}
}
}
When you plug in your topology, you add two sink nodes, names "success" and "error" that write to success and error topic respectively.
Or you forward data to a single sink node and add the sink with a TopicNameExtractor instead of a hard coded topic name. (Requires version 2.0.)
If you use DSL, you can use KStream#branch() to split a stream and pile different data to different topics via KStream#to(...) (or you use the dynamic routing via KStream#to(TopicNameExtractor) -- required version 2.0)

What type of pattern should you use for CRUD when all you need is Read

I was recently asked on a coding interview to write a simple Java console app that does some file io and displays the data. I was going to go to town with a DAO but since I never manipulate the data past a read, the entire idea of a DAO seems overkill.
Anyone know a clean way to ensure separation of concern without the weight of full CRUD when you don't need it ?

Looks like standard MVC pattern. Your console is the view, the code that reads file is the controller and the code that captures file line or whole file content is your model.
You can further simplify it as View and Model where model will encapsulate both file reading and wrapping its content into Java class.

How about Martin Fowler's Table Gateway pattern, explained here. Just include the find (Read) methods and miss create, insert, and update.

you can simply refer Command /Query pattern ,where commands are one which perform create update and delete operation seperately and Queries are introduce to read only purpose .
hence you implement what you need and left the others

This question was in interview so there was not much time for detailed design, As a minimum fulfillment of above concerns, following structure will provide flexibility. details could be filled as per the requirements.
public interface IODevice {
String read();
void write(String data);
}
class FileIO implements IODevice {
#Override
public String read() {
return null;
}
#Override
public void write(String data) {
//...;
}
}
class ConsoleIO implements IODevice {
#Override
public String read() {
return null;
}
#Override
public void write(String data) {
//... null;
}
}
public class DataConverter {
public static void main(String[] args) {
FileIO fData1 = null;// ... appropriately obtained instance;
FileIO fData2 = null;// ... appropriately obtained instance;
ConsoleIO cData = null;// ... appropriately obtained instance;
cData.write(fData2.read());
fData1.write(cData.read());
}
}
The client class uses only APIs of the devices. This will keep option of extending interface to implement new device wrapper (e.g. xml, stream etc)

Android Single observer with multiple subscribers in separate classes

ok, so i'm trying to implement rxJava2 with retrofit2. The goal is to make a call only once and broadcast the results to different classes. For exmaple: I have a list of geofences in my backend. I need that list in my MapFragment to dispaly them on the map, but I also need that data to set the pendingIntent service for the actual trigger.
I tried following this awnser, but I get all sorts of errors:
Single Observable with Multiple Subscribers
The current situation is as follow:
GeofenceRetrofitEndpoint:
public interface GeofenceEndpoint {
#GET("geofences")
Observable<List<Point>> getGeofenceAreas();
}
GeofenceDAO:
public class GeofenceDao {
#Inject
Retrofit retrofit;
private final GeofenceEndpoint geofenceEndpoint;
public GeofenceDao(){
InjectHelper.getRootComponent().inject(this);
geofenceEndpoint = retrofit.create(GeofenceEndpoint.class);
}
public Observable<List<Point>> loadGeofences() {
return geofenceEndpoint.getGeofenceAreas().subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.share();
}
}
MapFragment / any other class where I need the results
private void getGeofences() {
new GeofenceDao().loadGeofences().subscribe(this::handleGeoResponse, this::handleGeoError);
}
private void handleGeoResponse(List<Point> points) {
// handle response
}
private void handleGeoError(Throwable error) {
// handle error
}
What am I doing wrong, because when I call new GeofenceDao().loadGeofences().subscribe(this::handleGeoResponse, this::handleGeoError); it's doing a separate call each time. Thx

new GeofenceDao().loadGeofences() returns two different instances of the Observable. share() only applies to the instance, not the the method. If you want to actually share the observable, you'd have to subscribe to the same instance. You could share the it with a (static) member loadGeofences.
private void getGeofences() {
if (loadGeofences == null) {
loadGeofences = new GeofenceDao().loadGeofences();
}
loadGeofences.subscribe(this::handleGeoResponse, this::handleGeoError);
}
But be careful not to leak the Obserable.

Maybe it's not answering your question directly, however I'd like to suggest you a little different approach:
Create a BehaviourSubject in your GeofenceDao and subscribe your retrofit request to this subject. This subject will act as a bridge between your clients and api, by doing this you will achieve:
Response cache - handy for screen rotations
Replaying response for every interested observer
Subscription between clients and subject doesn't rely on subscription between subject and API so you can break one without breaking another

Using protobuf with flink

I'm using flink to read data from kafka and convert it to protobuf. The problem I'm facing is when I run the java application I get the below error. If I modify the unknownFields variable name to something else, it works but it's hard to make this change on all protobuf classes.
I also tried to deserialize directly when reading from kafka but I'm not sure what should be the TypeInformation to be returned for getProducedType() method.
public static class ProtoDeserializer implements DeserializationSchema{
#Override
public TypeInformation getProducedType() {
// TODO Auto-generated method stub
return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
}
Appreciate all the help. Thanks.
java.lang.RuntimeException: The field protected com.google.protobuf.UnknownFieldSet com.google.protobuf.GeneratedMessage.unknownFields is already contained in the hierarchy of the class com.google.protobuf.GeneratedMessage.Please use unique field names through your classes hierarchy
at org.apache.flink.api.java.typeutils.TypeExtractor.getAllDeclaredFields(TypeExtractor.java:1594)
at org.apache.flink.api.java.typeutils.TypeExtractor.analyzePojo(TypeExtractor.java:1515)
at org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1412)
at org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1319)
at org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:609)
at org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:437)
at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:306)
at org.apache.flink.api.java.typeutils.TypeExtractor.getFlatMapReturnTypes(TypeExtractor.java:133)
at org.apache.flink.streaming.api.datastream.DataStream.flatMap(DataStream.java:529)
Code:
FlinkKafkaConsumer09<byte[]> kafkaConsumer = new FlinkKafkaConsumer09<>("testArr",new ByteDes(),p);
DataStream<byte[]> input = env.addSource(kafkaConsumer);
DataStream<PBAddress> protoData = input.map(new RichMapFunction<byte[], PBAddress>() {
#Override
public PBAddress map(byte[] value) throws Exception {
PBAddress addr = PBAddress.parseFrom(value);
return addr;
}
});

Maybe you should try this follow:
env.getConfig().registerTypeWithKryoSerializer(PBAddress. class,ProtobufSerializer.class);
or
env.getConfig().registerTypeWithKryoSerializer(PBAddress. class,PBAddressSerializer.class);
public class PBAddressSerializer extends Serializer<Message> {
final private Map<Class,Method> hashMap = new HashMap<Class, Method>();
protected Method getParse(Class cls) throws NoSuchMethodException {
Method method = hashMap.get(cls);
if (method == null) {
method = cls.getMethod("parseFrom",new Class[]{byte[].class});
hashMap.put(cls,method);
}
return method;
}
#Override
public void write(Kryo kryo, Output output, Message message) {
byte[] ser = message.toByteArray();
output.writeInt(ser.length,true);
output.writeBytes(ser);
}
#Override
public Message read(Kryo kryo, Input input, Class<Message> pbClass) {
try {
int size = input.readInt(true);
byte[] barr = new byte[size];
input.read(barr);
return (Message) getParse(pbClass).invoke(null,barr);
} catch (Exception e) {
throw new RuntimeException("Could not create " + pbClass, e);
}
}
}

try this:
public class ProtoDeserializer implements DeserializationSchema<PBAddress> {
#Override
public TypeInformation<PBAddress> getProducedType() {
return TypeInformation.of(PBAddress.class);
}

https://issues.apache.org/jira/browse/FLINK-11333 is the JIRA ticket tracking the issue of implementing first-class support for Protobuf types with evolvable schema. You'll see that there was a pull request quite some time ago, which hasn't been merged. I believe the problem was that there is no support there for handling state migration in cases where Protobuf was previously being used by registering it with Kryo.
Meanwhile, the Stateful Functions project (statefun is a new API that runs on top of Flink) is based entirely on Protobuf, and it includes support for using Protobuf with Flink: https://github.com/apache/flink-statefun/tree/master/statefun-flink/statefun-flink-common/src/main/java/org/apache/flink/statefun/flink/common/protobuf. (The entry point to that package is ProtobufTypeInformation.java.) I suggest exploring this package (which includes nothing statefun specific); however, it doesn't concern itself with migrations from Kryo either.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to access metadata stored by Spark Streaming custom receiver? - java

Related

How to use non-keyed state with Kafka Consumer in Flink?

Is there a way to send data to a Kafka topic directly from within Processor?

What type of pattern should you use for CRUD when all you need is Read

Android Single observer with multiple subscribers in separate classes

Using protobuf with flink

Categories

Resources