setting variables in apache flink

setting variables in apache flink - java

I'm asking this question because I'm having trouble setting variables in apache flink. i would like to use a stream to fetch data with which i will initialize the variables i need for the second stream. The problem is that the streams execute in parallel, which results in a missing value when initializing the second stream. sample code:
KafkaSource<Object> mainSource1 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream1 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// fetching data from the stream and setting variables
Map<TopicPartition, Long> endOffset = new HashMap<>();
endOffset.put(new TopicPartition("topicName", 0), offsetFromMainStream1);
KafkaSource<Object> mainSource2 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setBounded(OffsetsInitializer.offsets(endOffset))
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream2 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// further stream operations
I would like to call the first stream from which I will fetch the data and set it locally then I can use it in operations on the second stream

You want to use one Stream's data to control another Stream's behavior. The best way is to use the Broadcast state pattern.
This involves creating a BroadcastStream from mainStream1, and then connecting mainStream2 to mainStream1. Now mainStream2 can access the data from mainStream1.
Here is a high level example based on your code. I am assuming that the key is String.
// Broadcast Stream
MapStateDescriptor<String, Market> stateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Market>() {}));
// broadcast the rules and create the broadcast state
BroadcastStream<Market> mainStream1BroadcastStream = mainStream1.keyBy(// key by Id).
.broadcast(stateDescriptor);
DataStream<Market> yourOutput = mainStream2
.connect(mainStream1BroadcastStream)
.process(
new KeyedBroadcastProcessFunction<>() {
// You can access mainStream1 output and mainStream2 data here.
}
);
This concept is explained in detail here. The code is also a modified version shown here -
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/#the-broadcast-state-pattern

Related

spark broadcast variable not serializable exception

I want to use an Rtree like https://github.com/davidmoten/rtree2 as a spark broadcast variable.
However, java.io.NotSerializableException: com.github.davidmoten.rtree2.RTree it is not supported.
Is there a workaround? https://github.com/davidmoten/rtree2 suggestes:
/ deserialize the entries from disk (for example)
List<Entry<Thing, Point> entries = ...
// bulk load
RTree<Thing, Point> tree = RTree.maxChildren(28).star().create(entries);
But I do not know how to fit this in the context of the broadcast variable. I.e. I certainly could broadcast the list of entries, but I do not know an entry point in initializing the Rtree then on all the executors when using a UDF.
Certainly, it should be possible via map-partitions, but I would by far prefer the UDF approach.
import com.github.davidmoten.grumpy.core.Position
import com.github.davidmoten.rtree2.geometry.{Geometries, Point}
import com.github.davidmoten.rtree2.{Entry, Iterables, RTree}
val sydney = Geometries.point(151.2094, -33.86)
val canberra = Geometries.point(149.1244, -35.3075)
val brisbane = Geometries.point(153.0278, -27.4679)
val bungendore = Geometries.point(149.4500, -35.2500)
var tree = RTree.star.create[String, Point]
tree = tree.add("Sydney", sydney)
tree = tree.add("Brisbane", brisbane)
val broadcastVar = spark.sparkContext.broadcast(tree)
this fails with the aforementioned exception.
By the way this also applies to:
https://github.com/davidmoten/rtree
RTree tree = ...;
OutputStream os = ...;
Serializer serializer =
Serializers.flatBuffers().utf8();
serializer.write(tree, os);
apparently this should work, but at least for spark it will not work with the same exception
edit 2
Workaround:
Using something like https://github.com/plokhotnyuk/rtree2d which properly supports serialization. Nonetheless it would be interesting how to retrofit it to the first example.

Updating data in a Finite state machine

I am using the FSM framework with AKKA using its Java API to manage state transitions . Here is the relevant portion of the state machine
when(QUEUED,
matchEvent(Exception.class, Service.class,
(exception, dservice) -> goTo(ERROR)
.replying(ERROR)));
// TODO:It seems missing from the DOC that to transition from a state , every state must be
// listed
// a service is in a errored state
when(ERROR,
matchAnyEvent((state, data) -> stay().replying("Staying in Errored state")));
onTransition(matchState(QUEUED, ERROR, () -> {
// Update the Service object and save it to the database
}));
This works as expected and the correct state changes happen with the actor. IN the onTansition() block , I want to update the Service object which is the finite state machine data in this case, something as follows
Service.setProperty(someProperty)
dbActor.tell(saveService);
Is this possible? Am I using this framework in the right way?
I think I was able to do something like the following
onTransition(matchState(QUEUED,ERROR, () -> {
nextStateData().setServiceStatus(ERROR);
// Get the actual exception message here to save to the database
databaseWriter.tell(nextStateData(), getSelf());
}));
How do I now actually test the data thats changed as a result of this transition?
The test looks like this
#Test
public void testErrorState() {
new TestKit(system) {
{
TestProbe probe = new TestProbe(system);
final ActorRef underTest = system.actorOf(ServiceFSMActor.props(dbWriter));
underTest.tell(new Exception(), getRef());
expectMsgEquals(ERROR); // This works
// How do I make sure the data is updated here as part of the OnTransition declaration??
}
};
}

You've defined a probe in the test but you're not using it. Since the FSM actor sends the updated state to the database writer actor, you could test the updated state by replacing the database writer actor with the probe:
new TestKit(system) {{
final TestProbe probe = new TestProbe(system);
final ActorRef underTest = system.actorOf(ServiceFSMActor.props(probe.ref()));
underTest.tell(new Exception(), getRef());
expectMsgEquals(ERROR);
final Service state = probe.expectMsgClass(Service.class);
assertEquals(ERROR, state.getServiceStatus());
}};

Apache Storm Bolt cannot receive any Tuple from other Bolt emit

I'm newbie for Storm. I want to use one bolt named 'tileClean' to emit single Stream, and other five bolts to receive the Stream at same time.
like this:
flow image
as you see, "one,two,three,four,five" bolt will received same data at the same time. but in actually, "one,two,three,four,five" bolts cannot receive any data.
there are my codes:
#Override
public void execute(TupleWindow inputWindow) {
logger.debug("clean");
List<Tuple> tuples = inputWindow.get();
//logger.debug("clean phrase. tuple size is : {}", tuples.size());
for (Tuple input : tuples) {
// some other code..
//this._collector.emit(input, new Values(nal));
this._collector.emit("stream_id_one", input, new Values(nal));
this._collector.emit("stream_id_two", input, new Values(nal));
this._collector.emit("stream_id_three", input, new Values(nal));
this._collector.emit("stream_id_four", input, new Values(nal));
this._collector.emit("stream_id_five", input, new Values(nal));
this._collector.ack(input);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_one", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_two", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_three", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_four", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_five", new Fields(BoltConstant.EMIT_LOGOBJ));
}
and topology set is:
builder.setBolt("tileClean", cleanBolt, 1).shuffleGrouping("assembly");
builder.setBolt("OneBolt", OneBolt, 1).shuffleGrouping("tileClean", "stream_id_one");
builder.setBolt("TwoBolt", TwoBolt, 1).shuffleGrouping("tileClean", "stream_id_two");
builder.setBolt("ThreeBolt", ThreeBolt, 1).shuffleGrouping("tileClean", "stream_id_three");
builder.setBolt("FourBolt", FourBolt, 1).shuffleGrouping("tileClean", "stream_id_four");
builder.setBolt("FiveBolt", FiveBolt, 1).shuffleGrouping("tileClean", "stream_id_five");
tileClean can receive Tuples that emit from assymble, but other bolts cannot receive.
is there my code anything incorrect?

As you have omitted the code between the "for loop" statement and the first collector.emit statement, one of the possibilities for the messages to not pass through is proper error handling in between the omitted code. You can ensure putting try-catch block or debug by logging just before the "collector.emit" statement to check if your code is indeed reaching there.
The above can also be checked on the storm-ui where it will show the topology metrics of transmitting messages in between the spouts/bolts. It also reports any errors messages that may have occurred in between task execution.
The other possibility is in case you are using a multi-node cluster, if your tasks are spread out on the node, (i.e., if you have assigned more than 1 worker in the topology config), ensure that the machines can communicate with each other over the network on the designated ports configured in storm.yaml files.

process and manage Tabular data stream in java programming

I want to know how to process and manage Tabular data stream in java programming.
consider there is a table of records has the scheme ( name, age, zip-code, disease) and the records are to be read and processed tuple by tuple in time as a stream. i want to manage these stream tuples to save the processed tuples with the the scheme ( age, zip- code, disease ) ( name attribute is supposed to be deleted )
for input example .. read Tuple 1 ( han, 25, 12548, flue) at Time t1
publish Tuple 1* ( 25, 12548, flue)
read Tuple 2 ( alex, 27, 12544, cancer) 1 at t2 .
output Tuple 2* (27, 12544, cancer).
.. and so on, Can anyone Help me?

Here are some suggestions for a framework you can base your final application on.
First, make classes to represent your input and output records. We'll call them InRecord and OutRecord for the sake of discussion, but you can give them whatever names make sense for you. Give them private fields to hold the necessary data and public getter/setter methods to access the data.
Second, define an interface for an input supplier; let's call it InputSupplier for this discussion. It will need to provide methods for setup (open()) and tear-down (close()) methods to be called at the start and end of processing, and a getNext() method that returns the next available InRecord. You'll need to decide how it indicate end-of-input: either define that getNext() will return null if
there are no more input records, or provide a hasNext() method to call which will return true or false to indicate if another input record is available.
Third, define an interface for an output consumer (OutputConsumer). You'll want to have open() and close() methods, as well as an accept(OutRecord) method.
With this infrastructure in place, you can write your processing method:
public void process(InputSupplier in, OutputConsumer out){
in.open();
out.open();
InRecord inrec;
while ((inrec = in.getNext()) != null){
OutRecord outrec = new OutRecord(in.getAge(), in.getZipCode(), in.getDisease());
out.accept(outrec);
}
out.close();
in.close();
}
Finally, write some "dummy" I/O classes, one that implements InputSupplier and another that implements OutputConsumer. For test purposes, your input supplier can just return a few hand-created records and your output consumer could just print on the console the output records you send it.
Then all you need is a main method to tie it all together:
public static void main(String[] args){
InputSupplier in = new TestInput();// our "dummy" input supplier class
OuputConsumer out = new TestOutput(); // our "dummy" output consumer
process(in, out);
}
For the "real" application you'd write a "real" input supplier class, still implementing the InputSupplier interface, that can read from from a database or an Excel file or whatever input source, and an new output consumer class, still implementing the OutputConsumer interface, that can take output records and store them into whatever appropriate format. Your processing logic won't have to change, because you coded it in terms of InputSupplier and OutputConsumer interfaces. Now just tweak main a bit and you've got your final app:
public static void main(String[] args){
InputSupplier in = new RealInput();// our "real" input supplier class
OuputConsumer out = new RealOutput(); // our "real" output consumer
process(in, out);
}

SIngle RxJava how to extract object

I can think of two ways to get the value from Single
Single<HotelResult> observableHotelResult =
apiObservables.getHotelInfoObservable(requestBody);
final HotelResult[] hotelResults = new HotelResult[1];
singleHotelResult
.subscribe(hotelResult -> {
hotelResults[0] = hotelResult;
});
Or
final HotelResult hotelResult = singleHotelResult
.toBlocking()
.value();
It's written in the documentation that we should avoid using .toBlocking method.
So is there any better way to get value

Even it is not recommended to block it (you should subscribe), in RxJava v2 the method for blocking is blockingGet(), it returns the object immediately.

When we use toBlocking then we get result immediately. When we use subscribe then result is obtained asynchronously.
Single<HotelResult> observableHotelResult =
apiObservables.getHotelInfoObservable(requestBody);
final HotelResult[] hotelResults = new HotelResult[1];
singleHotelResult.subscribe(hotelResult -> {
hotelResults[0] = hotelResult;
});
// hotelResults[0] may be not initialized here yet
// println not show result yet (if operation for getting hotel info is long)
System.out.println(hotelResults[0]);
For blocking case:
final HotelResult hotelResult = singleHotelResult.toBlocking().value();
// hotelResult has value here but program workflow will stuck here until API is being called.
toBlocking helps in the cases when you are using Observables in "normal" code where you need to have the result in place.
subscribe helps you for example in Android application when you can set some actions in subscribe like show result on the page, make button disabled etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

setting variables in apache flink - java

Related

spark broadcast variable not serializable exception

Updating data in a Finite state machine

Apache Storm Bolt cannot receive any Tuple from other Bolt emit

process and manage Tabular data stream in java programming

SIngle RxJava how to extract object

Categories

Resources