How do I extract the real-time FileIO state from Akka? - java

I am making a file transfer system using Akka. I've been looking at the documents for a while. The current status of progress is Actor2 received the file sent by Actor1 and wrote it to the local system of Actor2 (Actor1 = sender, Actor2 = receiver).
But I couldn't find a way to know how much byte I received in real time when writing.
I tested it, and it turns out, with runWith API, files can be written locally. With runForeach API, how much byte was delivered in real time through. However, if these two are created at the same time, the file cannot be written.
Here's my simple source. Please give me some advice.
public static Behavior<Command> create() {
return Behaviors.setup(context -> {
context.getLog().info("Registering myself with receptionist");
context.getSystem().receptionist().tell(Receptionist.register(RECEIVER_SERVICE_KEY, context.getSelf().narrow()));
Materializer mat = Materializer.createMaterializer(context);
return Behaviors.receive(Command.class)
.onMessage(TransferFile.class, command -> {
command.sourceRef.getSource().runWith(FileIO.toPath(Paths.get("test.pptx")), mat);
//command.replyTo.tell(new FileTransfered("filename", 1024));
command.sourceRef.getSource().runForeach(f -> System.out.println(f.size()), mat);
return Behaviors.same();
}).build();
});
}

Use a BroadcastHub to allow multiple consumers of your Source:
Source<ByteString, NotUsed> fileSource = command.sourceRef.getSource();
RunnableGraph<Source<ByteString, NotUsed>> runnableGraph =
fileSource.toMat(BroadcastHub.of(ByteString.class, 256), Keep.right());
// adjust the buffer size (256) as needed
Source<ByteString, NotUsed> fromFileSource = runnableGraph.run(mat);
fromFileSource.runWith(FileIO.toPath(Paths.get("test.pptx")), mat);
fromFileSource.runForeach(f -> System.out.println(f.size()), mat);

BroadcastHub as suggested by Jeffrey, allows for a single running stream to be connected to multiple other streams that are started and stopped over time.
Having a stream that dynamically connects to others requires quite a lot of extra hoops internally, so if you don't need that it is better to avoid that overhead.
If you use case is rather that you want to consume a single source with two sinks that is better done with source.alsoTo(sink1).to(sink2).
alsoTo in the flow API is backed by the Broadcast operator, but using that directly requires that you use the Graph DSL.

Related

Why does my watermark not advance in my Apache Flink keyed stream?

I am currently using Apache Flink 1.13.2 with Java for my streaming application. I am using a keyed function with no window function. I have implemented a watermark strategy and autoWatermarkInterval config per the documentation, although my watermark is not advancing.
I have double-checked this by using the Flink web UI and printing the current watermark in my EventProcessor KeyedProcessFunction but the watermark is constantly set to a very large negative number -9223372036854775808 (lowest possible watermark).
env.getConfig().setAutoWatermarkInterval(1000);
WatermarkStrategy<EventPayload> watermarkStrategy = WatermarkStrategy
.<EventPayload>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.getTimestamp());
DataStream<EventPayload> deserialized = input
.assignTimestampsAndWatermarks(watermarkStrategy)
.flatMap(new Deserializer());
DataStream<EnrichedEventPayload> resultStream =
AsyncDataStream.orderedWait(deserialized, new Enrichment(), 5, TimeUnit.SECONDS, 100);
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.process(new EventProcessor());
I even tried to add the WatermarkStrategy to the stream where it is using keyBy (and adjusting the types to match) but still no luck.
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.assignTimestampsAndWatermarks(watermarkStrategy)
.process(new EventProcessor());
I have also tried using my own class implementing WatermarkStrategy and set breakpoints on the onEvent function to ensure the new watermark was being emitted, although it still did not advance (and any associated timers did not fire).
Any help would be greatly appreciated!
This will happen if one of the parallel instances of the watermark strategy is idle (i.e., if there are no events flowing through it). Using the withIdleness(...) option on the watermark strategy would be one way to solve this.

Is it safe for a Flink application to have multiple data/key streams in s job all sharing the same Kafka source and sink?

(Goal Updated)
My goal on each data stream is:
filter different msgs
have different event time defined window session gaps
consumer from topic and produce to another topic
A fan-out -> fan-in like DAG.
var fanoutStreamOne = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamTwo = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamThree = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreams = Set.of(fanoutStreamOne, fanoutStreamTwo, fanoutStreamThree)
var source = new FlinkKafkaConsumer<>(...);
var sink = new FlinkKafkaProducer<>(...);
// creates streams from same source to same sink (Using union())
new streamingJob(source, sink, fanoutStreams).execute();
I am just curious if this affects recovery/checkpoints or performance of the Flink application.
Has anyone had success with this implementation?
And should I have the watermark strategy up front before filtering?
Thank in advance!
Okay, the differenced time gaps are not possible, I think so. I tried it a year ago, with flink 1.7 , and I can't do it. The watermark is global to the application.
To the other problems, if you are using Kafka, yo can read from some topics using regex, and get the topic using the properly deserialization schema (here).
To filter the messages, I think you can use the filter functions with the dide output streams :) (here)

Crash during stream materialization with OpenCV BackgroundSubtractorMOG2 with Akka streams in Java

for my project I want to create an application that does some video analysis with OpenCV libs in Java while using Akka streams.
I tried using the BGsubtractorMOG2 in a separate project that doesn't use Akka streams and everything works fine, but now when I materialize my stream with a stage containing the MOG2 function my program crashes. I am sure that the problem is in MOG2 because if I try to remove it and just capture and show frames on video everything works fine.
here is some of the code inside an akka actor
private final Materializer materializer =
ActorMaterializer.create(this.getContext());
private final BackgroundSubtractorMOG2 mog2 = Video.createBackgroundSubtractorMOG2();
this gets executed in a preStart() method
the following creates an akka stream source that generates frames with openCV VideoCapture
this.frameSource = Source.fromGraph(new CameraFrameSource(capture));
this is the middle part of the stream where I want the video analysis to be done; it creates a copy of the frame, one goes through the video analysis and the other goes untouched into a zip and waits to its copy to be processed by HOG2
this.videoAnalysisPartialGraph = GraphDSL.create(builder -> {
final UniformFanOutShape<Mat, Mat> A = builder.add(Broadcast.create(2));
final FlowShape<Mat, Mat> bgs = builder.add(Flow.of(Mat.class).map(
f -> {return subtractBackground(f); }).async());
final FanInShape2<Mat, Mat, Pair<Mat, Mat>> zip =
builder.add(ZipWith.create((Mat left, Mat right) -> {
return new Pair<Mat, Mat>(left, right);
}));
builder.from(A).toInlet(zip.in1());
builder.from(A).via(bgs).toInlet(zip.in0());
return new FlowShape<Mat, Pair<Mat, Mat>>(A.in(), zip.out());
});
this is the substraction method that makes the program crash when it gets materialized and starts running
private Mat subtractBackground(Mat frame){
Mat fgmask = new Mat();
this.mog2.apply(frame, fgmask);
return fgmask;
}
this is the closed graph created for debugging purposes that gets materialized once the actor receives a message that simply picks 33 frames each second and process them in stages creating a pair of Mat, then one of the two Mat in the pair is picked and showed in a window, the killswitch part is there to help me turn off the stream, not sure if it's actually needed, but doesn't bother since the stream works when there is no MOG2 involved
this.stream = frameSource.throttle(33, FiniteDuration.create(1, TimeUnit.SECONDS), 1, ThrottleMode.shaping())
.via(this.videoAnalysisPartialGraph).map(p -> p.first()).viaMat(KillSwitches.single(), Keep.right())
.toMat(Sink.foreach(f -> showFrame(f)), Keep.left());
this is the method that runs the stream
private void startVideoCapture() {
this.capture.open(cameraId);
if (capture.isOpened()) {
this.cameraActive = true;
killswitch = this.stream.run(materializer);
} else {
System.err.println("Can't open camera connection.");
}
}
As I said the stream works perfectly when I don't do background subtraction and I just show the captured video on screen, going trhough the same stream graph (modified to not include hog2 of course).
Does it have something to do with dispatchers/materializator? I have no idea, does anyone have any suggestion?
Thank you
EDIT:
I tried applying MOG2 outside of the stream but still inside the AKKA actor and still the program crashes. So now I think that could be something related to how an Akka actor deals with the MOG2.apply call.
WORKAROUND
I solved the problem after I found that there was a conflict between OpenCV libs and akka-remote libs, for the moment I removed the latter and everything works correctly.
I may need to use both libs together in the future so at the moment I still don't know how to completely solve this

Does JCODEC Support MPEG-TS or MPEG-PS

I am trying to be able to pick out frames (video and metadata) from MPEG, MPEG-TS and MPEG-PS files and live streams (network / UDP / RTP streams). I was looking into using JCODEC to do this and I started off by trying to use the FrameGrab / FrameGrab8Bit classes, and ran into an error that those formats are "temporarily unsupported". I looked into going back some commits to see if I could just use older code, but it looks like both of those files have had those formats "temporarily unsupported" since 2013 / 2015, respectively.
I then tried to plug things back into the FrameGrab8Bit class by putting in the below code...
public static FrameGrab8Bit createFrameGrab8Bit(SeekableByteChannel in) throws IOException, JCodecException {
...
SeekableDemuxerTrack videoTrack = null;
...
case MPEG_PS:
MPSDemuxer psd = new MPSDemuxer(in);
List tracks = psd.getVideoTracks();
videoTrack = (SeekableDemuxerTrack)tracks.get(0);
break;
case MPEG_TS:
in.setPosition(0);
MTSDemuxer tsd = new MTSDemuxer(in);
ReadableByteChannel program = tsd.getProgram(481);
MPSDemuxer ptsd = new MPSDemuxer(program);
List<MPEGDemuxerTrack> tstracks = ptsd.getVideoTracks();
MPEGDemuxerTrack muxtrack = tstracks.get(0);
videoTrack = (SeekableDemuxerTrack)tstracks.get(0);
break;
...
but I ran into a packet header assertion failure in the MTSDemuxer.java class in the parsePacket function:
public static MTSPacket parsePacket(ByteBuffer buffer) {
int marker = buffer.get() & 0xff;
int marker = by & 0xff;
Assert.assertEquals(0x47, marker);
...
I found that when I reset the position of the seekable byte channel (i.e.: in.setPosition(0)) the code makes it past the assert, but then fails at videoTrack = (SeekableDemuxerTrack)tstracks.get(0) (tstracks.get(0) cannot be converted to a SeekableDemuxerTrack)
Am I waisting my time? Are these formats supported somewhere in the library and I am just not able to find them?
Also, after going around in the code and making quick test applications, it seems like all you get out of the demuxers are video frames. Is there no way to get the metadata frames associated with the video frames?
For reference, I am using the test files from: http://samples.ffmpeg.org/MPEG2/mpegts-klv/
In case anyone in the future also has this question. I got a response from a developer on the project's GitHub page to this question. Response:
Yeah, MPEG TS is not supported to the extent MP4 is. You can't really seek in TS streams (unless you index the entire stream before hand).
I also asked about how to implement the feature. I thought that it could be done by reworking the MTSDemuxer class to be built off of the SeekableDemuxerTrack so that things would be compatible with the FrameGrab8Bit class, and got the following response:
So it doesn't look like there's much sense to implement TS demuxer on top of SeekableDemuxerTrack. We haven't given much attention to TS demuxer actually, so any input is very welcome.
I think this (building the MTSDemuxer class off of the SeekableDemuxerTrack interface) would work for files (since you have everything already there). But without fully fleshing out that thought, I could not say for sure (it definitely makes sense that this solution would not work for a live MPEG-TS / PS connection).

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

Categories