I had a bolt to which input file keeps on updating. But I can't take updated content since I am reading the file from prepare() method. I want to take updated file without stopping or killing the topology. Is there anything like watch service in Storm to do it? Or any different approach for this?
One approach to your problem is defining a Spout that would periodically check if the file changed. Once it does, it would send a tuple notifying your bolt about a change. The bolt would in turn reload the file. Here are a few hints about implementation:
Topology will contain the new monitoring spout. Your bolt will subscribe to it's stream and to any other stream it needs (bolts can consume multiple streams):
topologyBuilder.setSpout("file_checking_spout", new FileCheckingSpout(myMonitoredFile));
topologyBuilder.setBolt("my_bolt", new MyBolt())
.shuffleGrouping("file_checking_spout")
.shuffleGrouping("whatever other grouping you need");
Spout will do the monitoring. If there is only one file to monitor, you can just emit empty tuples as notification:
public class FileCheckingSpout extends BaseRichSpout {
#Override
public void nextTuple() {
Thread.sleep(500);
if (fileChanged()) { // check e.g. file modified timestamp
collector.emit(new Values());
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields());
}
// ...
}
Your bolt will now have to accept the notifications about file reload. It can distinguish notification tuples e.g. using tuple.getSourceComponent():
class MyBolt implements IRichBolt {
#Override
public void execute(Tuple tuple) {
if ("file_checking_spout".equals(tuple.getSourceComponent())) {
reloadFile();
return;
}
// normal processing
}
//...
}
You could also simply check if the file changed in your bolt's nextTuple(). The way described above is more "the Storm way" as it separates concerns and reloading is not dependent on any other streams.
PS: Naturally, this will work as long as the file is accessible from both spout and bolt, i.e., if you are running in a cluster, it should be on a shared file system.
Related
I'm trying to implement a reactive, in-memory repository. How should this be accomplished?
This is a blocking version of what I'm trying to do
#Repository
#AllArgsConstructor
public class InMemEventRepository implements EventRepository {
private final List<Event> events;
#Override
public void save(final Mono<Event> event) {
events.add(event.block());
// event.subscribe(events::add); <- does not do anything
}
#Override
public Flux<Event> findAll() {
return Flux.fromIterable(events);
}
}
I tried using event.subscribe(events::add); but the event was not added to the list (perhaps I'm missing something there?)
Perhaps events should be of type Flux<Event> and there is some way to add the Mono<Event> to Flux<Event>?
I suggest to use a Sink for this purpose.
public static class InMemEventRepository {
private final Scheduler serializerScheduler = Schedulers.single();
private final Sinks.Many<Event> events = Sinks.many().replay().all();
public void save(Mono<Event> event) {
event
.publishOn(serializerScheduler) // If event will be published on multiple threads you need to serialize them
.subscribe(x -> events.emitNext(x, EmitFailureHandler.FAIL_FAST));
}
public Flux<Event> findAll() {
return events.asFlux();
}
}
This is with reactor 3.4. With older versions you could have used a Processor but they are now deprecated. Sinks in general are easier to use but they do not serialize emission from multiple threads. That's why I use the Scheduler.
See also this answer for an alternative approach to serialize emission from the Sink
If you go for the Flux.fromIterable, you'll only get a subscription for the previous events, but you'll loose the future ones
I did a PoC some time in the past trying to get a similar effect, you can check it in https://github.com/AlbertoSH/KeepMeUpdated
The main idea is to have a central point in which the events happens and the repository is subscribed to. Whenever you subscribe to findAll, you'll get an infinite stream of List<Item>. Any saved item will trigger a new event and anyone subscribed to findAll will get it
Beware that this repo is using RxJava, so some port to reactor might be needed
I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-duplication whether that json element was discovered previously. I'm doing a ParDo with a DoFn on each json.
I haven't seen any online tutorial saying how to call an external API endpoint from apache beam DoFn Dataflow.
I'm using JAVA SDK of Beam. Some of the tutorial I studied explained that using startBundle and FinishBundle but I'm not clear on how to use it
If you need to check duplicates in external storage for every JSON record, then you still can use DoFn for that. There are several annotations, like #Setup, #StartBundle, #FinishBundle, etc, that can be used to annotate methods in your DoFn.
For example, if you need to instantiate a client object to send requests to your external database, then you might want to do this in #Setup method (like POJO constructor) and then leverage this client object in your #ProcessElement method.
Let's consider a simple example:
static class MyDoFn extends DoFn<Record, Record> {
static transient MyClient client;
#Setup
public void setup() {
client = new MyClient("host");
}
#ProcessElement
public void processElement(ProcessContext c) {
// process your records
Record r = c.element();
// check record ID for duplicates
if (!client.isRecordExist(r.id()) {
c.output(r);
}
}
#Teardown
public void teardown() {
if (client != null) {
client.close();
client = null;
}
}
}
Also, to avoid doing remote calls for every record, you can batch bundle records into internal buffer (Beam split input data into bundles) and check duplicates in batch mode (if your client support this). For this purpose, you might use #StartBundle and #FinishBundle annotated methods that will be called right before and after processing Beam bundle accordingly.
For more complicated examples, I'd recommend to take a look on a Sink implementations in different Beam IOs, like KinesisIO, for instance.
There is an example of calling external system in batches using a stateful DoFn in the following blog post: https://beam.apache.org/blog/2017/08/28/timely-processing.html, might be helpful.
Is it possible to use interactive query (InteractiveQueryService) within Spring Cloud Stream the class with #EnableBinding annotation or within the method with #StreamListener? I tried instantiating ReadOnlyKeyValueStore within provided KStreamMusicSampleApplication class and process method but its always null.
My #StreamListener method is listening to a bunch of KTables and KStreams and during the process topology e.g filtering, I have to check whether the key from a KStream already exists in a particular KTable.
I tried to figure out how to scan an incoming KTable to check if a key already exists but no luck. Then I came across InteractiveQueryService whose get() method could be used to check if a key exists inside a state store materializedAs from a KTable. The problem is that I can't access it from with the process topology (#EnableBinding or #StreamListener). It can only be accessed from outside these annotation e.g RestController.
Is there a way to scan an incoming KTable to check for the existence of a key or value? if not then can we access InteractiveQueryService within the process topology?
InteractiveQueryService in Spring Cloud Stream is not available to be used within the actual topology in your StreamListener. As you mentioned, it is supposed to be used outside of your main topology. However, with the use case you described, you still can use the state store from your main flow. For example, if you have an incoming KStream and a KTable which is materialized as a state store, then you can call process on the KStream and access the state store that way. Here is a rough code to achieve that. You need to convert this to fit into your specific use case, but here is the idea.
ReadOnlyKeyValueStore<Object, String> store;
input.process(() -> new Processor<Object, Product>() {
#Override
public void init(ProcessorContext processorContext) {
store = (ReadOnlyKeyValueStore) processorContext.getStateStore("my-store");
}
#Override
public void process(Object key, Object value) {
//find the key
store.get(key);
}
#Override
public void close() {
if (state != null) {
state.close();
}
}
}, "my-store");
I am building an application in Play Framework that has to do some intense file parsing. This parsing involves parsing multiple files, preferably in parallel.
A user uploads an archive that gets unziped and the files are stored on the drive.
In that archive there is a file (let's call it main.csv) that has multiple columns. One such column is the name of another file from the archive (like subPage1.csv). This column can be empty, so that not all rows from the main.csv have subpages.
Now, I start an Akka Actor to parse the main.csv file. In this actor, using #Inject, I have another ActorRef
public MainParser extends ActorRef {
#Inject
#Named("subPageParser")
private AcgtorRef subPageParser;
public Receive createReceive() {
...
if (column[3] != null) {
subPageParser.tell(column[3], getSelf());
}
}
}
SubPageParser Props:
public static Props getProps(JPAApi jpaApi) {
return new RoundRobinPool(3).props(Props.create((Class<?>) SubPageParser.class, jpaApi));
}
Now, my question is this. Considering that a subPage may take 5 seconds to be parsed, will I be using a single instance of SubPageParser or will there be multiple instances that do the processing in parallel.
Also, consider another scenario, where the names are stored in the DB, and I use something like this:
List<String> names = dao.getNames();
for (String name: names) {
subPageParser.tell(name, null);
}
In this case, considering that the subPageParser ActorRef is obtained using Guice #Inject as before, will I do parallel processing?
If I am doing processing in parallel, how do I control the number of Actors that are being spawned? If I have 1000 subPages, I don't want 1000 Actors. Also, their lifetime may be an issue.
NOTE:
I have an ActorsModule like this, so that I can use #Inject and not Props:
public class ActorsModule extends AbstractModule implements AkkaGuiceSupport {
#Override
protected void configure() {
bindActor(MainParser.class, "mainparser");
Function<Props, Props> props = p -> SubPageParser.getProps();
bindActor(SubPageParser.class, "subPageParser", props);
}
}
UPDATE: I have modified to use a RoundRobinPool. However, This does not work as intended. I specified 3 as the number of instances, but I get a new object for each parse request tin the if.
Injecting an actor like you did will lead to one SubPageParser per MainParser. While you might send 1000 messages to it (using tell), they will get processed one by one while the others are waiting in the mailbox to be processed.
With regards to your design, you need to be aware that injecting an actor like that will create another top-level actor rather than create the SubPageParser as a child actor, which would allow the parent actor to control and monitor it. The playframework has support for injecting child actors, as described in their documentation: https://www.playframework.com/documentation/2.6.x/JavaAkka#Dependency-injecting-child-actors
While you could get akka to use a certain number of child actors to distribute the load, I think you should question why you have used actors in the first place. Most problems can be solved with simple Futures. For example you can configure a custom thread pool to run your Futures with and have them do the work at a parallelization level as you wish: https://www.playframework.com/documentation/2.6.x/ThreadPools#Using-other-thread-pools
I have a spout class that has several integer and string attributes, which are serialized/deserialized as expected. The class also has 1 LinkedList holding byte arrays. This LinkedList is always empty when an object is deserialized.
I've added log statements into all of the spout methods and can see the spout's 'activate' method being called, after which, the LinkedList is empty. I do not see any logs when this happens for the 'deactivate' method.
It seems odd that the spout 'activate' method is being called without the 'deactivate' method having been called. When the 'activate' method is called, there has not been any resubmission of the topology.
I also have a log statement in the spout constructor, which is not called prior to the LinkedList being emptied.
I've also verified repeatedly that there are no calls anywhere within the spout class to any method that would completely empty the LinkedList. There is 1 spot that uses the poll method, which is immediately followed by a log statement to log the new LinkedList size.
I found this reference, which points to Kryo being used for Serialization, but it may just be for serializing tuple data.
http://storm.apache.org/documentation/Serialization.html
Storm uses Kryo for serialization. Kryo is a flexible and fast
serialization library that produces small serializations.
By default, Storm can serialize primitive types, strings, byte arrays,
ArrayList, HashMap, HashSet, and the Clojure collection types. If you
want to use another type in your tuples, you'll need to register a
custom serializer.
The article makes it sound like Kryo may be just for serializing and passing tuples, but if it is for the Spout object as well, I can't figure out how to then use a LinkedList as ArrayLists and HashMaps aren't really a good alternative for a FIFO queue. Will I have to roll my own LinkedList?
public class MySpout extends BaseRichSpout
{
private SpoutOutputCollector _collector;
private LinkedList<byte[]> messages = new LinkedList<byte[]>();
public MyObject()
{
queue = new LinkedList<ObjectType>();
}
public void add(byte[] message)
{
messages.add(message);
}
#Override
public void open( Map conf, TopologyContext context,
SpoutOutputCollector collector )
{
_collector = collector;
try
{
Logger.getInstance().addMessage("Opening Spout");
// ####### Open client connection here to read messages
}
catch (MqttException e)
{
e.printStackTrace();
}
}
#Override
public void close()
{
Logger.getInstance().addMessage("Close Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void activate()
{
Logger.getInstance().addMessage("Activate Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void nextTuple()
{
if (!messages.isEmpty())
{
System.out.println("Tuple emitted from spout");
_collector.emit(new Values(messages.poll()));
Logger.getInstance().addMessage("Tuple emitted from spout. Remaining in queue: " + messages.size());
try
{
Thread.sleep(1);
}
catch (InterruptedException e)
{
// TODO Auto-generated catch block
Logger.getInstance().addMessage("Sleep thread interrupted in nextTuple(). " + Logger.convertStacktraceToString(e));
e.printStackTrace();
}
}
}
}
EDIT:
Java Serialization of referenced objects is "losing values"?
http://www.javaspecialists.eu/archive/Issue088.html
The above SO link and the java specialists article call out specific examples similar to what I am seeing and the issue is do the serialization/deserialization cache. But because Storm is doing that work, I'm not sure what can be done about the issue.
At the end of the day, it also seems like the bigger issue is that Storm is suddenly serializing/deserializing the data in the first place.
EDIT:
Just prior to the Spout being activated, a good number log messages come through in less than a second that read:
Executor MyTopology-1-1447093098:[X Y] not alive
After those messages, there is a log of:
Setting new assignment for topology id MyTopology-1-1447093098: #backtype.storm.daemon.common.Assignment{:master-code-dir ...
If I understand your problem correctly, you instantiate your spout at the client side, add messages via addMessage(), give the spout to the TopologyBuilder via addSpout(), and submit the topology afterwards to your cluster? When the topology is started, you expect the spout message list to contain the messages you added? If this is correct, you usage pattern is quite odd...
I guess the problem is related to Thrift which is used to submit the topology to the cluster. Java serialization is not used and I assume, that the Thrift code does not serialize the actual object. As far as I understand the code, the topology jar is shipped binary, and the topology structure is shipped via Thrift. On the workers that executes the topology, new spout/bolt object are created via new. Thus, no Java serialization/deserialization happens and you LinkedList is empty. Due to the call of new it is of course not null either.
Btw: you are right about Kryo, it is only used to ship data (ie, tuples).
As a work around, you could add the LinkedList to the Map that is given to StormSubmitter.submitTopology(...). In Spout.open(...) you should get a correct copy of your messages from the Map parameter. However, as I mentioned already, your usage pattern is quite odd -- you might want to rethink this. In general, a spout should be implemented in a way, that is can fetch the data in nextTuple() from an external data source.