I have stream of objects with address and list of organizations:
#Data
class TaggedObject {
String address;
List<String> organizations;
}
Is there a way to do the following using apache flink:
Merge organization lists for objects with same address
Send all results to Sink when some event occurs. E.g. when user sends control message to a kafka topic or another DataSource
Keep all objects for future accumulations
I tried using global window and custom trigger:
public class MyTrigger extends Trigger<TaggedObject, GlobalWindow> {
#Override
public TriggerResult onElement(TaggedObject element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
if (element instanceof Control) return TriggerResult.FIRE;
else return TriggerResult.CONTINUE;
}
But it seems to give only Control element as a result. Other elements were ignored.
If you want a generic control signal that triggers output for ALL addresses, then you'll need to use a broadcast stream. You combine your stream of addresses with your control stream and then perform the appropriate logic (merging organizations for an address, or triggering output) inside of your custom implementation of a KeyedBroadcastProcessFunction.
It seems like you should just key the stream by address and then use a KeyedProcessFunction (with a List- or MapState) to store the different organizations. Then as soon as an event comes in, you can just output the entries of the State.
Kind Regards
Dominik
Related
Is it possible to use interactive query (InteractiveQueryService) within Spring Cloud Stream the class with #EnableBinding annotation or within the method with #StreamListener? I tried instantiating ReadOnlyKeyValueStore within provided KStreamMusicSampleApplication class and process method but its always null.
My #StreamListener method is listening to a bunch of KTables and KStreams and during the process topology e.g filtering, I have to check whether the key from a KStream already exists in a particular KTable.
I tried to figure out how to scan an incoming KTable to check if a key already exists but no luck. Then I came across InteractiveQueryService whose get() method could be used to check if a key exists inside a state store materializedAs from a KTable. The problem is that I can't access it from with the process topology (#EnableBinding or #StreamListener). It can only be accessed from outside these annotation e.g RestController.
Is there a way to scan an incoming KTable to check for the existence of a key or value? if not then can we access InteractiveQueryService within the process topology?
InteractiveQueryService in Spring Cloud Stream is not available to be used within the actual topology in your StreamListener. As you mentioned, it is supposed to be used outside of your main topology. However, with the use case you described, you still can use the state store from your main flow. For example, if you have an incoming KStream and a KTable which is materialized as a state store, then you can call process on the KStream and access the state store that way. Here is a rough code to achieve that. You need to convert this to fit into your specific use case, but here is the idea.
ReadOnlyKeyValueStore<Object, String> store;
input.process(() -> new Processor<Object, Product>() {
#Override
public void init(ProcessorContext processorContext) {
store = (ReadOnlyKeyValueStore) processorContext.getStateStore("my-store");
}
#Override
public void process(Object key, Object value) {
//find the key
store.get(key);
}
#Override
public void close() {
if (state != null) {
state.close();
}
}
}, "my-store");
I am developing an application that creates some Akka actors to manage and process messages coming from a Kafka topic. Messages with the same key are processed by the same actor. I use the message key also to name the corresponding actor.
When a new message is read from the topic, I don't know if the actor with the id equal to the message key was already created by the actor system or not. Therefore, I try to resolve the actor using its name, and if it does not exist yet, I create it. I need to manage concurrency in regard to actor resolution. So it is possible that more than one client asks the actor system if an actor exists.
The code I am using right now is the following:
private CompletableFuture<ActorRef> getActor(String uuid) {
return system.actorSelection(String.format("/user/%s", uuid))
.resolveOne(Duration.ofMillis(1000))
.toCompletableFuture()
.exceptionally(ex ->
system.actorOf(Props.create(MyActor.class, uuid), uuid))
.exceptionally(ex -> {
try {
return system.actorSelection(String.format("/user/%s",uuid)).resolveOne(Duration.ofMillis(1000)).toCompletableFuture().get();
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException(e);
}
});
}
The above code is not optimised, and the exception handling can be made better.
However, is there in Akka a more idiomatic way to resolve an actor, or to create it if it does not exist? Am I missing something?
Consider creating an actor that maintains as its state a map of message IDs to ActorRefs. This "receptionist" actor would handle all requests to obtain a message processing actor. When the receptionist receives a request for an actor (the request would include the message ID), it tries to look up an associated actor in its map: if such an actor is found, it returns the ActorRef to the sender; otherwise it creates a new processing actor, adds that actor to its map, and returns that actor reference to the sender.
I would consider using akka-cluster and akka-cluster-sharding. First, this gives you throughput, and as well, reliability. However, it will also make the system manage the creation of the 'entity' actors.
But you have to change the way you talk to those actors. You create a ShardRegion actor which handles all the messages:
import akka.actor.AbstractActor;
import akka.actor.ActorRef;
import akka.actor.ActorSystem;
import akka.actor.Props;
import akka.cluster.sharding.ClusterSharding;
import akka.cluster.sharding.ClusterShardingSettings;
import akka.cluster.sharding.ShardRegion;
import akka.event.Logging;
import akka.event.LoggingAdapter;
public class MyEventReceiver extends AbstractActor {
private final ActorRef shardRegion;
public static Props props() {
return Props.create(MyEventReceiver.class, MyEventReceiver::new);
}
static ShardRegion.MessageExtractor messageExtractor
= new ShardRegion.HashCodeMessageExtractor(100) {
// using the supplied hash code extractor to shard
// the actors based on the hashcode of the entityid
#Override
public String entityId(Object message) {
if (message instanceof EventInput) {
return ((EventInput) message).uuid().toString();
}
return null;
}
#Override
public Object entityMessage(Object message) {
if (message instanceof EventInput) {
return message;
}
return message; // I don't know why they do this it's in the sample
}
};
public MyEventReceiver() {
ActorSystem system = getContext().getSystem();
ClusterShardingSettings settings =
ClusterShardingSettings.create(system);
// this is setup for the money shot
shardRegion = ClusterSharding.get(system)
.start("EventShardingSytem",
Props.create(EventActor.class),
settings,
messageExtractor);
}
#Override
public Receive createReceive() {
return receiveBuilder().match(
EventInput.class,
e -> {
log.info("Got an event with UUID {} forwarding ... ",
e.uuid());
// the money shot
deviceRegion.tell(e, getSender());
}
).build();
}
}
So this Actor MyEventReceiver runs on all nodes of your cluster, and encapsulates the shardRegion Actor. You no longer message your EventActors directly, but, using the MyEventReceiver and deviceRegion Actors, you use the sharding system keep track of which node in the cluster the particular EventActor lives on. It will create one if none have been created before, or route it messages if it has. Every EventActor must have a unique id: which is extracted from the message (so a UUID is pretty good for that, but it could be some other id, like a customerID, or an orderID, or whatever, as long as its unique for the Actor instance you want to process it with).
(I'm omitting the EventActor code, it's otherwise a pretty normal Actor, depending what you are doing with it, the 'magic' is in the code above).
The sharding system automatically knows to create the EventActor and allocate it to a shard, based on the algorithm you've chosen (in this particular case, it's based on the hashCode of the unique ID, which is all I've ever used). Furthermore, you're guaranteed only one Actor for any given unique ID. The message is transparently routed to the correct Node and Shard wherever it is; from whichever Node and Shard it's being sent.
There's more info and sample code in the Akka site & documentation.
This is a pretty rad way to make sure that the same Entity/Actor always processes messages meant for it. The cluster and sharding takes automatic care of distributing the Actors properly, and failover and the like (you would have to add akka-persistence to get passivation, rehydration, and failover if the Actor has a bunch of strict state associated with it (that must be restored)).
The answer by Jeffrey Chung is indeed of Akka way. The downside of such approach is its low performance. The most performant solution is to use Java's ConcurrentHashMap.computeIfAbsent() method.
I am building an application in Play Framework that has to do some intense file parsing. This parsing involves parsing multiple files, preferably in parallel.
A user uploads an archive that gets unziped and the files are stored on the drive.
In that archive there is a file (let's call it main.csv) that has multiple columns. One such column is the name of another file from the archive (like subPage1.csv). This column can be empty, so that not all rows from the main.csv have subpages.
Now, I start an Akka Actor to parse the main.csv file. In this actor, using #Inject, I have another ActorRef
public MainParser extends ActorRef {
#Inject
#Named("subPageParser")
private AcgtorRef subPageParser;
public Receive createReceive() {
...
if (column[3] != null) {
subPageParser.tell(column[3], getSelf());
}
}
}
SubPageParser Props:
public static Props getProps(JPAApi jpaApi) {
return new RoundRobinPool(3).props(Props.create((Class<?>) SubPageParser.class, jpaApi));
}
Now, my question is this. Considering that a subPage may take 5 seconds to be parsed, will I be using a single instance of SubPageParser or will there be multiple instances that do the processing in parallel.
Also, consider another scenario, where the names are stored in the DB, and I use something like this:
List<String> names = dao.getNames();
for (String name: names) {
subPageParser.tell(name, null);
}
In this case, considering that the subPageParser ActorRef is obtained using Guice #Inject as before, will I do parallel processing?
If I am doing processing in parallel, how do I control the number of Actors that are being spawned? If I have 1000 subPages, I don't want 1000 Actors. Also, their lifetime may be an issue.
NOTE:
I have an ActorsModule like this, so that I can use #Inject and not Props:
public class ActorsModule extends AbstractModule implements AkkaGuiceSupport {
#Override
protected void configure() {
bindActor(MainParser.class, "mainparser");
Function<Props, Props> props = p -> SubPageParser.getProps();
bindActor(SubPageParser.class, "subPageParser", props);
}
}
UPDATE: I have modified to use a RoundRobinPool. However, This does not work as intended. I specified 3 as the number of instances, but I get a new object for each parse request tin the if.
Injecting an actor like you did will lead to one SubPageParser per MainParser. While you might send 1000 messages to it (using tell), they will get processed one by one while the others are waiting in the mailbox to be processed.
With regards to your design, you need to be aware that injecting an actor like that will create another top-level actor rather than create the SubPageParser as a child actor, which would allow the parent actor to control and monitor it. The playframework has support for injecting child actors, as described in their documentation: https://www.playframework.com/documentation/2.6.x/JavaAkka#Dependency-injecting-child-actors
While you could get akka to use a certain number of child actors to distribute the load, I think you should question why you have used actors in the first place. Most problems can be solved with simple Futures. For example you can configure a custom thread pool to run your Futures with and have them do the work at a parallelization level as you wish: https://www.playframework.com/documentation/2.6.x/ThreadPools#Using-other-thread-pools
I have a spout class that has several integer and string attributes, which are serialized/deserialized as expected. The class also has 1 LinkedList holding byte arrays. This LinkedList is always empty when an object is deserialized.
I've added log statements into all of the spout methods and can see the spout's 'activate' method being called, after which, the LinkedList is empty. I do not see any logs when this happens for the 'deactivate' method.
It seems odd that the spout 'activate' method is being called without the 'deactivate' method having been called. When the 'activate' method is called, there has not been any resubmission of the topology.
I also have a log statement in the spout constructor, which is not called prior to the LinkedList being emptied.
I've also verified repeatedly that there are no calls anywhere within the spout class to any method that would completely empty the LinkedList. There is 1 spot that uses the poll method, which is immediately followed by a log statement to log the new LinkedList size.
I found this reference, which points to Kryo being used for Serialization, but it may just be for serializing tuple data.
http://storm.apache.org/documentation/Serialization.html
Storm uses Kryo for serialization. Kryo is a flexible and fast
serialization library that produces small serializations.
By default, Storm can serialize primitive types, strings, byte arrays,
ArrayList, HashMap, HashSet, and the Clojure collection types. If you
want to use another type in your tuples, you'll need to register a
custom serializer.
The article makes it sound like Kryo may be just for serializing and passing tuples, but if it is for the Spout object as well, I can't figure out how to then use a LinkedList as ArrayLists and HashMaps aren't really a good alternative for a FIFO queue. Will I have to roll my own LinkedList?
public class MySpout extends BaseRichSpout
{
private SpoutOutputCollector _collector;
private LinkedList<byte[]> messages = new LinkedList<byte[]>();
public MyObject()
{
queue = new LinkedList<ObjectType>();
}
public void add(byte[] message)
{
messages.add(message);
}
#Override
public void open( Map conf, TopologyContext context,
SpoutOutputCollector collector )
{
_collector = collector;
try
{
Logger.getInstance().addMessage("Opening Spout");
// ####### Open client connection here to read messages
}
catch (MqttException e)
{
e.printStackTrace();
}
}
#Override
public void close()
{
Logger.getInstance().addMessage("Close Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void activate()
{
Logger.getInstance().addMessage("Activate Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void nextTuple()
{
if (!messages.isEmpty())
{
System.out.println("Tuple emitted from spout");
_collector.emit(new Values(messages.poll()));
Logger.getInstance().addMessage("Tuple emitted from spout. Remaining in queue: " + messages.size());
try
{
Thread.sleep(1);
}
catch (InterruptedException e)
{
// TODO Auto-generated catch block
Logger.getInstance().addMessage("Sleep thread interrupted in nextTuple(). " + Logger.convertStacktraceToString(e));
e.printStackTrace();
}
}
}
}
EDIT:
Java Serialization of referenced objects is "losing values"?
http://www.javaspecialists.eu/archive/Issue088.html
The above SO link and the java specialists article call out specific examples similar to what I am seeing and the issue is do the serialization/deserialization cache. But because Storm is doing that work, I'm not sure what can be done about the issue.
At the end of the day, it also seems like the bigger issue is that Storm is suddenly serializing/deserializing the data in the first place.
EDIT:
Just prior to the Spout being activated, a good number log messages come through in less than a second that read:
Executor MyTopology-1-1447093098:[X Y] not alive
After those messages, there is a log of:
Setting new assignment for topology id MyTopology-1-1447093098: #backtype.storm.daemon.common.Assignment{:master-code-dir ...
If I understand your problem correctly, you instantiate your spout at the client side, add messages via addMessage(), give the spout to the TopologyBuilder via addSpout(), and submit the topology afterwards to your cluster? When the topology is started, you expect the spout message list to contain the messages you added? If this is correct, you usage pattern is quite odd...
I guess the problem is related to Thrift which is used to submit the topology to the cluster. Java serialization is not used and I assume, that the Thrift code does not serialize the actual object. As far as I understand the code, the topology jar is shipped binary, and the topology structure is shipped via Thrift. On the workers that executes the topology, new spout/bolt object are created via new. Thus, no Java serialization/deserialization happens and you LinkedList is empty. Due to the call of new it is of course not null either.
Btw: you are right about Kryo, it is only used to ship data (ie, tuples).
As a work around, you could add the LinkedList to the Map that is given to StormSubmitter.submitTopology(...). In Spout.open(...) you should get a correct copy of your messages from the Map parameter. However, as I mentioned already, your usage pattern is quite odd -- you might want to rethink this. In general, a spout should be implemented in a way, that is can fetch the data in nextTuple() from an external data source.
The server I'm developing has different tasks to perform based on messages received from clients, some tasks are very simple and require little time to perform, but other may take a while.
Adding an ExecutionHandler to the pipeline seems like a good solution for the complicated tasks but I would like to avoid threading simple tasks.
My pipeline looks like this:
pipeline.addLast("decoder", new MessageDecoder());
pipeline.addLast("encoder", new MessageEncoder());
pipeline.addLast("executor", this.executionHandler);
pipeline.addLast("handler", new ServerHandler(this.networkingListener));
Where MessageEncoder returns a Message object (for decode) which defines the requested task.
Is there a way to skip the execution handler based on the decoded message?
The question can be generalized to: is there a way to condition whether or not the next handler will be used?
Thanks.
Instead of using ExecutionHandler as is, you can extend it to override its handlerUpstream() method to intercept the upstream events and call ctx.sendUpstream(e) for the MessageEvents whose message meets your criteria. All other events could be handled by the ExecutionHandler via super.sendUpstream(e). That is:
public class MyExecutionHandler extends ExecutionHandler {
public void handleUpstream(ctx, evt) throws Exception {
if (evt instanceof MessageEvent) {
Object msg = ((MessageEvent) evt).getMessage();
if (msg instanceof ExecutionSkippable) {
ctx.sendUpstream(evt);
return;
}
}
super.handleUpstream(evt);
}
...
}
You can remove it (or add it on demand) from the pipeline inside your MessageDecoder before you send the message upstream. You can also check the message inside your executionHandler and just pass it upstream.
In case you cannot modify these two files you can create another handler which removes executionHandler based on the message type.