I am new to Flink Streaming API and I want to complete the following simple (IMO) task. I have two streams and I want to join them using count-based windows. The code I have so far is the following:
public class BaselineCategoryEquiJoin {
private static final String recordFile = "some_file.txt";
private static class ParseRecordFunction implements MapFunction<String, Tuple2<String[], MyRecord>> {
public Tuple2<String[], MyRecord> map(String s) throws Exception {
MyRecord myRecord = parse(s);
return new Tuple2<String[], myRecord>(myRecord.attributes, myRecord);
}
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment();
ExecutionConfig config = environment.getConfig();
config.setParallelism(8);
DataStream<Tuple2<String[], MyRecord>> dataStream = environment.readTextFile(recordFile)
.map(new ParseRecordFunction());
DataStream<Tuple2<String[], MyRecord>> dataStream1 = environment.readTextFile(recordFile)
.map(new ParseRecordFunction());
DataStreamSink<Tuple2<String[], String[]>> joinedStream = dataStream1
.join(dataStream)
.where(new KeySelector<Tuple2<String[],MyRecord>, String[]>() {
public String[] getKey(Tuple2<String[], MyRecord> recordTuple2) throws Exception {
return recordTuple2.f0;
}
}).equalTo(new KeySelector<Tuple2<String[], MyRecord>, String[]>() {
public String[] getKey(Tuple2<String[], MyRecord> recordTuple2) throws Exception {
return recordTuple2.f0;
}
}).window(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.apply(new JoinFunction<Tuple2<String[],MyRecord>, Tuple2<String[],MyRecord>, Tuple2<String[], String[]>>() {
public Tuple2<String[], String[]> join(Tuple2<String[], MyRecord> tuple1, Tuple2<String[], MyRecord> tuple2) throws Exception {
return new Tuple2<String[], String[]>(tuple1.f0, tuple1.f0);
}
}).print();
environment.execute();
}
}
My code works without errors, but it does not produce any results. In fact, the call to apply method is never called (verified by adding a breakpoint on debug mode). I think, the main reason for the previous is that my data do not have a time attribute. Therefore, windowing (materialized through window) is not done properly. Therefore, my question is how can I indicate that I want my join to take place based on count-windows. For instance, I want the join to materialize every 100 tuples from each stream. Is the previous feasible in Flink? If yes, what should I change in my code to achieve it.
At this point, I have to inform you that I tried to call the countWindow() method, but for some reason it is not offered by Flink's JoinedStreams.
Thank you
Count-based joins are not supported. You could emulate count-based windows, by using "event-time" semantics and apply a unique seq-id as timestamp to each record. Thus, a time-window of "5" would be effectively a count-window of 5.
Related
I'm essentially asking the same as this old question, but for Java 14 instead of Java 8. To spare answerers the trouble of navigating to the old question, I'll rephrase it here.
I want to get the name of a function from a referenced method. The following Java code should give you the idea:
public class Main
{
public static void main(String[] args)
{
printMethodName(Main::main);
}
private static void printMethodName(Consumer<String[]> theFunc)
{
String funcName = // somehow get name from theFunc
System.out.println(funcName)
}
}
The equivalent in C# would be:
public class Main
{
public static void Main()
{
var method = Main.Main;
PrintMethodName(method)
}
private static void PrintMethodName(Action action)
{
Console.WriteLine(action.GetMethodInfo().Name);
}
}
According to the accepted answer of the old question, this was not possible in Java 8 without considerable work, such as this solution. Is there a more elegant solution in Java 14?
Getting a method info from a method reference never was a goal on the JDK developer’s side, so no effort was made to change the situation.
However, the approach shown in your link can be simplified. Instead of serializing the information, patching the serialized data, and restoring the information using a replacement object, you can simply intercept the original SerializedLambda object while serializing.
E.g.
public class GetSerializedLambda extends ObjectOutputStream {
public static void main(String[] args) { // example case
var lambda = (Consumer<String[]>&Serializable)GetSerializedLambda::main;
SerializedLambda sl = GetSerializedLambda.get(lambda);
System.out.println(sl.getImplClass() + " " + sl.getImplMethodName());
}
private SerializedLambda info;
GetSerializedLambda() throws IOException {
super(OutputStream.nullOutputStream());
super.enableReplaceObject(true);
}
#Override protected Object replaceObject(Object obj) throws IOException {
if(obj instanceof SerializedLambda) {
info = (SerializedLambda)obj;
obj = null;
}
return obj;
}
public static SerializedLambda get(Object obj) {
try {
GetSerializedLambda getter = new GetSerializedLambda();
getter.writeObject(obj);
return getter.info;
} catch(IOException ex) {
throw new IllegalArgumentException("not a serializable lambda", ex);
}
}
}
which will print GetSerializedLambda main. The only newer feature used here, is the OutputStream.nullOutputStream() to drop the written information immediately. Prior to JDK 11, you could write into a ByteArrayOutputStream and drop the information after the operation which is only slightly less efficient. The example also using var, but this is irrelevant to the actual operation of getting the method information.
The limitations are the same as in JDK 8. It requires a serializable method reference. Further, there is no guaranty that the implementation will map to a method directly. E.g., if you change the example’s declaration to public static void main(String... args), it will print something like lambda$1 when being compiled with Eclipse. When also changing the next line to var lambda = (Consumer<String>&Serializable)GetSerializedLambda::main;, the code will always print a synthetic method name, as using a helper method is unavoidable. But in case of javac, the name is rather something like lambda$main$f23f6912$1 instead of Eclipse’s lambda$1.
In other words, you can expect encountering surprising implementation details. Do not write applications relying on the availability of such information.
I'm trying to implement (just starting work with Java and Flink) a non-keyed state in KafkaConsumer object, since in this stage no keyBy() in called. This object is the front end and the first module to handle messages from Kafka.
SourceOutput is a proto file representing the message.
I have the KafkaConsumer object :
public class KafkaSourceFunction extends ProcessFunction<byte[], SourceOutput> implements Serializable
{
#Override
public void processElement(byte[] bytes, ProcessFunction<byte[], SourceOutput>.Context
context, Collector<SourceOutput> collector) throws Exception
{
// Here, I want to call to sorting method
collector.collect(output);
}
}
I have an object (KafkaSourceSort) that do all the sorting and should keep the unordered message in priorityQ in the state and also responsible to deliver the message if it comes in the right order thru the collector.
class SessionInfo
{
public PriorityQueue<SourceOutput> orderedMessages = null;
public void putMessage(SourceOutput Msg)
{
if(orderedMessages == null)
orderedMessages = new PriorityQueue<SourceOutput>(new SequenceComparator());
orderedMessages.add(Msg);
}
}
public class KafkaSourceState implements Serializable
{
public TreeMap<String, SessionInfo> Sessions = new TreeMap<>();
}
I read that I need to use a non-keyed state (ListState) which should contain a map of sessions while each session contains a priorityQ holding all messages related to this session.
I found an example so I implement this:
public class KafkaSourceSort implements SinkFunction<KafkaSourceSort>,
CheckpointedFunction
{
private transient ListState<KafkaSourceState> checkpointedState;
private KafkaSourceState state;
#Override
public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception
{
checkpointedState.clear();
checkpointedState.add(state);
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception
{
ListStateDescriptor<KafkaSourceState> descriptor =
new ListStateDescriptor<KafkaSourceState>(
"KafkaSourceState",
TypeInformation.of(new TypeHint<KafkaSourceState>() {}));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored())
{
state = (KafkaSourceState) checkpointedState.get();
}
}
#Override
public void invoke(KafkaSourceState value, SinkFunction.Context contex) throws Exception
{
state = value;
// ...
}
}
I see that I need to implement an invoke message which probably will be called from processElement() but the signature of invoke() doesn't contain the collector and I don't understand how to do so or even if I did OK till now.
Please, a help will be appreciated.
Thanks.
A SinkFunction is a terminal node in the DAG that is your job graph. It doesn't have a Collector in its interface because it cannot emit anything downstream. It is expected to connect to an external service or data store and send data there.
If you share more about what you are trying to accomplish perhaps we can offer more assistance. There may be an easier way to go about this.
I have the following StateMachineConfigurerAdapter class (currently using spring-statemachine 2.1.3.RELEASE)
#lombok.extern.slf4j.Slf4j
#EnableStateMachineFactory(name = "defaultSMF", contextEvents = false)
class MyStateMachineConfiguration extends StateMachineConfigurerAdapter<String, String> {
#Override
public void configure(StateMachineConfigurationConfigurer<String, String> config) throws Exception {
config.withConfiguration()
.machineId("defaultSMF")
.autoStartup(true)
.listener(new ContextStateMachineListener<>(log));
}
#Override
public void configure(StateMachineStateConfigurer<String, String> states) throws Exception {
states.withStates()
.initial(GpsState.STATE_CREATED)
.end(GpsState.STATE_ENDED)
.states(new HashSet<>(Arrays.asList("STATE_CREATED", "STATE_UPSERTED", "STATE_ENDED")));
}
#Override
public void configure(StateMachineTransitionConfigurer<String, String> transitions) throws Exception {
var t = transitions.withExternal()
.source("STATE_CREATED")
.target("STATE_UPSERTED")
.event("UPSERT_STATE");
t.and().withExternal()
.source("STATE_UPSERTED")
.target("STATE_ENDED")
.event("END_STATE");
}
}
I tried using the Context Integration Approach mentioned in the reference docs. I created the following class.
#WithStateMachine(id = "defaultSMF")
class MyStateManagement {
#OnTransitionStart(source = "STATE_CREATED", targets = "STATE_UPSERTED")
public void onTransitionUpsertState() {
System.out.println("Transition UPSERT_STATE");
throw new RuntimeException("Abort transition");
}
#OnTransitionStart(source = "STATE_UPSERTED", targets = "STATE_ENDED")
public void onTransitionEndState() {
System.out.println("Transition END_STATE");
}
}
When the state machine is in the STATE_CREATED state and I send a UPSERT_STATE event, the state machine still changes to STATE_UPSERTED, even if an exception is thrown in the method that is triggered. The behaviour I'm expecting is that the state should not change and I'm unable to get this result.
The only way I can get the behaviour I want is to use the following approach inside the MyStateConfiguration class:
...
#Override
public void configure(StateMachineTransitionConfigurer<String, String> transitions) throws Exception {
var t = transitions.withExternal()
.source("STATE_CREATED")
.target("STATE_UPSERTED")
.event("UPSERT_STATE")
.action(context -> {
throw new RuntimeException("Abort transition!");
});
t.and().withExternal()
.source("STATE_UPSERTED")
.target("STATE_ENDED")
.event("END_STATE");
}
...
where I have added the action method in the flow. But using this approach is quite cumbersome. My preference would be to separate the handling of the flow and business logic into separate classes. The Spring Context Integration would be the best approach but I can't figure out how to get the behaviour I want using this. Any help is appreciated. Thanks in advance.
I am trying to make a reactive pipeline using Java and project-reactor where the use-case is that the application generates flow status(INIT, PROCESSING, SAVED, DONE) at different levels. The status must be emitted asynchronously to a flux which is needed to be handled independently and separately from the main flow. I came across this link:
Spring WebFlux (Flux): how to publish dynamically
My sample flow is something like this:
public class StatusEmitterImpl implements StatusEmitter {
private final FluxProcessor<String, String> processor;
private final FluxSink<String> sink;
public StatusEmitterImpl() {
this.processor = DirectProcessor.<String>create().serialize();
this.sink = processor.sink();
}
#Override
public Flux<String> publisher() {
return this.processor.map(x -> x);
}
public void publishStatus(String status) {
sink.next(status);
}
}
public class Try {
public static void main(String[] args) {
StatusEmitterImpl statusEmitter = new StatusEmitterImpl();
Flux.fromIterable(Arrays.asList("INIT", "DONE")).subscribe(x ->
statusEmitter.publishStatus(x));
statusEmitter.publisher().subscribe(x -> System.out.println(x));
}
}
The problem is that nothing is getting printed on the console. I cannot understand what I am missing.
DirectProcessor passes values to its registered Subscribers directly, without caching the signals. If there is no Subscriber, then the value is "forgotten". If a Subscriber comes in late, then it will only receive signals emitted after it subscribed.
That's what is happening here: because fromIterable works on an in-memory collection, it has time to push all values to the DirectProcessor, which by that time doesn't have a registered Subscriber yet.
If you invert the last two lines you should see something.
The DirectProcessor is hot publishers and don't buffer element,so you should produce element after its subscribe.like is
public static void main(String[] args) {
StatusEmitterImpl statusEmitter = new StatusEmitterImpl();
statusEmitter.publisherA().subscribe(x -> System.out.println(x));
Flux.fromIterable(Arrays.asList("INIT", "DONE")).subscribe(x -> statusEmitter.publishStatus(x));
}
, or use EmitterProcessor,UnicastProcessor instand of DirectProcessor.
There's some domain knowledge/business logic baked into the problem I'm trying to solve but I'll try to boil it down to the basics as much as possible.
Say I have an interface defined as follows:
public interface Stage<I, O> {
StageResult<O> process(StageResult<I> input) throws StageException;
}
This represents a stage in a multi-stage data processing pipeline, my idea is to break the data processing steps into sequential (non-branching) independent steps (such as read from file, parse network headers, parse message payloads, convert format, write to file) represented by individual Stage implementations. Ideally I'd implement a FileInputStage, a NetworkHeaderParseStage, a ParseMessageStage, a FormatStage, and a FileOutputStage, then have some sort of
Stage<A, C> compose(Stage<A, B> stage1, Stage<B, C> stage2);
method such that I can eventually compose a bunch of stages into a final stage that looks like FileInput -> FileOutput.
Is this something (specifically the compose method, or a similar mechanism for aggregating many stages into one stage) even supported by the Java type system? I'm hacking away at it now and I'm ending up in a very ugly place involving reflection and lots of unchecked generic types.
Am I heading off in the wrong direction or is this even a reasonable thing to try to do in Java? Thanks so much in advance!
You didn't post enough implementation details to show where the type safety issues are but here is my throw on how you could address the problem:
First dont make the whole thing too generic, make your satges specific reguarding their inputs and outputs
Then create a composit stage which implements Stage and combines two stages into one final result.
Here is a very simpele implementatiom
public class StageComposit<A, B, C> implements Stage<A, C> {
final Stage<A, B> stage1;
final Stage<B, C> stage2;
public StageComposit(Stage<A, B> stage1, Stage<B, C> stage2) {
this.stage1 = stage1;
this.stage2 = stage2;
}
#Override
public StageResult<C> process(StageResult<A> input) {
return stage2.process(stage1.process(input));
}
}
Stage result
public class StageResult<O> {
final O result;
public StageResult(O result) {
this.result = result;
}
public O get() {
return result;
}
}
Example specific Stages:
public class EpochInputStage implements Stage<Long, Date> {
#Override
public StageResult<Date> process(StageResult<Long> input) {
return new StageResult<Date>(new Date(input.get()));
}
}
public class DateFormatStage implements Stage<Date, String> {
#Override
public StageResult<String> process(StageResult<Date> input) {
return new StageResult<String>(
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
.format(input.get()));
}
}
public class InputSplitStage implements Stage<String, List<String>> {
#Override
public StageResult<List<String>> process(StageResult<String> input) {
return new StageResult<List<String>>(
Arrays.asList(input.get().split("[-:\\s]")));
}
}
And finally a small test demonstrating how to comibine all
public class StageTest {
#Test
public void process() {
EpochInputStage efis = new EpochInputStage();
DateFormatStage dfs = new DateFormatStage();
InputSplitStage iss = new InputSplitStage();
Stage<Long, String> sc1 =
new StageComposit<Long, Date, String>(efis, dfs);
Stage<Long, List<String>> sc2 =
new StageComposit<Long, String, List<String>>(sc1, iss);
StageResult<List<String>> result =
sc2.process(new StageResult<Long>(System.currentTimeMillis()));
System.out.print(result.get());
}
}
Output for current time would be a list of strings
[2015, 06, 24, 16, 27, 55]
As you see no type safety issues or any type castings. When you need to handle other types of inputs and outputs or convert them to suite the next stage just write a new Stage and hook it up in your stage processing chain.
You may want to consider using a composite pattern or a decorator pattern. For the decorator each stage will wrap or decorate the previous stage. To do this you have each stage implement the interface as you are doing allow a stage to contain another stage.
The process() method does not need to accept a StageResult parameter anymore since it can call the contained Stage's process() method itself, get the StageResult and perform its own processing, returning another StageResult.
One advantage is that you can restructure your pipeline at run time.
Each Stage that may contain another can extend the ComposableStage and each stage that is an end point of the process can extend the LeafStage. Note that I just used those terms to name the classes by function but you can create more imaginative names.