Apache Storm Bolt cannot receive any Tuple from other Bolt emit - java

I'm newbie for Storm. I want to use one bolt named 'tileClean' to emit single Stream, and other five bolts to receive the Stream at same time.
like this:
flow image
as you see, "one,two,three,four,five" bolt will received same data at the same time. but in actually, "one,two,three,four,five" bolts cannot receive any data.
there are my codes:
#Override
public void execute(TupleWindow inputWindow) {
logger.debug("clean");
List<Tuple> tuples = inputWindow.get();
//logger.debug("clean phrase. tuple size is : {}", tuples.size());
for (Tuple input : tuples) {
// some other code..
//this._collector.emit(input, new Values(nal));
this._collector.emit("stream_id_one", input, new Values(nal));
this._collector.emit("stream_id_two", input, new Values(nal));
this._collector.emit("stream_id_three", input, new Values(nal));
this._collector.emit("stream_id_four", input, new Values(nal));
this._collector.emit("stream_id_five", input, new Values(nal));
this._collector.ack(input);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_one", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_two", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_three", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_four", new Fields(BoltConstant.EMIT_LOGOBJ));
declarer.declareStream("stream_id_five", new Fields(BoltConstant.EMIT_LOGOBJ));
}
and topology set is:
builder.setBolt("tileClean", cleanBolt, 1).shuffleGrouping("assembly");
builder.setBolt("OneBolt", OneBolt, 1).shuffleGrouping("tileClean", "stream_id_one");
builder.setBolt("TwoBolt", TwoBolt, 1).shuffleGrouping("tileClean", "stream_id_two");
builder.setBolt("ThreeBolt", ThreeBolt, 1).shuffleGrouping("tileClean", "stream_id_three");
builder.setBolt("FourBolt", FourBolt, 1).shuffleGrouping("tileClean", "stream_id_four");
builder.setBolt("FiveBolt", FiveBolt, 1).shuffleGrouping("tileClean", "stream_id_five");
tileClean can receive Tuples that emit from assymble, but other bolts cannot receive.
is there my code anything incorrect?

As you have omitted the code between the "for loop" statement and the first collector.emit statement, one of the possibilities for the messages to not pass through is proper error handling in between the omitted code. You can ensure putting try-catch block or debug by logging just before the "collector.emit" statement to check if your code is indeed reaching there.
The above can also be checked on the storm-ui where it will show the topology metrics of transmitting messages in between the spouts/bolts. It also reports any errors messages that may have occurred in between task execution.
The other possibility is in case you are using a multi-node cluster, if your tasks are spread out on the node, (i.e., if you have assigned more than 1 worker in the topology config), ensure that the machines can communicate with each other over the network on the designated ports configured in storm.yaml files.

Related

setting variables in apache flink

I'm asking this question because I'm having trouble setting variables in apache flink. i would like to use a stream to fetch data with which i will initialize the variables i need for the second stream. The problem is that the streams execute in parallel, which results in a missing value when initializing the second stream. sample code:
KafkaSource<Object> mainSource1 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream1 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// fetching data from the stream and setting variables
Map<TopicPartition, Long> endOffset = new HashMap<>();
endOffset.put(new TopicPartition("topicName", 0), offsetFromMainStream1);
KafkaSource<Object> mainSource2 = KafkaSource.<Object>builder()
.setBootstrapServers(...)
.setTopicPattern(Pattern.compile(...))
.setGroupId(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setBounded(OffsetsInitializer.offsets(endOffset))
.setDeserializer(new ObjectDeserializer())
.build();
DataStream<Market> mainStream2 = env.fromSource(mainSource, WatermarkStrategy.forMonotonousTimestamps(), "mainSource");
// further stream operations
I would like to call the first stream from which I will fetch the data and set it locally then I can use it in operations on the second stream
You want to use one Stream's data to control another Stream's behavior. The best way is to use the Broadcast state pattern.
This involves creating a BroadcastStream from mainStream1, and then connecting mainStream2 to mainStream1. Now mainStream2 can access the data from mainStream1.
Here is a high level example based on your code. I am assuming that the key is String.
// Broadcast Stream
MapStateDescriptor<String, Market> stateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<Market>() {}));
// broadcast the rules and create the broadcast state
BroadcastStream<Market> mainStream1BroadcastStream = mainStream1.keyBy(// key by Id).
.broadcast(stateDescriptor);
DataStream<Market> yourOutput = mainStream2
.connect(mainStream1BroadcastStream)
.process(
new KeyedBroadcastProcessFunction<>() {
// You can access mainStream1 output and mainStream2 data here.
}
);
This concept is explained in detail here. The code is also a modified version shown here -
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/#the-broadcast-state-pattern

Dataflow Splittable ReadFn not using multiple workers

I have a particularly simple Dataflow pipeline where I want to read a file and output its parsed records to Avro. This works in most cases, except where the source file is particularly large (20+ GB) which causes me to OOM even with particularly large memory machines. I am pretty sure this happens because the non-splittable source is read in its entirety by Beam, so I implemented a splittable DoFn<FileIO.ReadableFile, GenericRecord>
This functionally works in that the pipeline now succeeds, which seems to validate my assumption that the single large batch from a non-splittable file is the cause. However, this does not seem to spread the work across multiple workers. I tried the following:
Disabled autoscaling (autoscalingAlgorithm=NONE) and set numWorkers to 10. This had the same throughput as numWorkers 1
Left autoscaling on with a high maxWorkers. This went briefly up to 2, and then came back down to 1
Added a shuffle (Reshuffle.viaRandomKey) after the DoFn, but before the Avro write
Any ideas? The exact code is difficult to share because of company policy, but overall is pretty simple. I implemented the following:
public class SplittableReadFn extends DoFn<FileIO.ReadableFile, GenericRecord> {
// ...
#ProcessElement
public void process(final ProcessContext c, final OffsetRangeTracker tracker) {
final FileIO.ReadableFile file = c.element();
// Followed by something like
ReadableByteStream in = file.open()
in.seek(tracker.from())
Parser parser = new Parser(in)
while (parser.next()) {
if (parser.getOffset() > tracker.to()) {
break
}
tracker.tryClaim(parser.getOffset())
c.output(parser.item())
}
tracker.markDone()
}
#GetInitialRestriction
public OffsetRange getInitialRestriction(final FileIO.ReadableFile file) {
return new Offset(0, getSize(file) - 1);
}
#SplitRestriction
public void splitRestriction(final FileIO.ReadableFile file, final OffsetRange restriction, final DoFn.OutputReceiver<OffsetRange> receiver) {
// chunkRange for test purposes just breaks into at most 500MB chunks
for (final OffsetRange chunk: chunkRange(restriction)) {
receiver.output(chunk);
}
}

process and manage Tabular data stream in java programming

I want to know how to process and manage Tabular data stream in java programming.
consider there is a table of records has the scheme ( name, age, zip-code, disease) and the records are to be read and processed tuple by tuple in time as a stream. i want to manage these stream tuples to save the processed tuples with the the scheme ( age, zip- code, disease ) ( name attribute is supposed to be deleted )
for input example .. read Tuple 1 ( han, 25, 12548, flue) at Time t1
publish Tuple 1* ( 25, 12548, flue)
read Tuple 2 ( alex, 27, 12544, cancer) 1 at t2 .
output Tuple 2* (27, 12544, cancer).
.. and so on, Can anyone Help me?
Here are some suggestions for a framework you can base your final application on.
First, make classes to represent your input and output records. We'll call them InRecord and OutRecord for the sake of discussion, but you can give them whatever names make sense for you. Give them private fields to hold the necessary data and public getter/setter methods to access the data.
Second, define an interface for an input supplier; let's call it InputSupplier for this discussion. It will need to provide methods for setup (open()) and tear-down (close()) methods to be called at the start and end of processing, and a getNext() method that returns the next available InRecord. You'll need to decide how it indicate end-of-input: either define that getNext() will return null if
there are no more input records, or provide a hasNext() method to call which will return true or false to indicate if another input record is available.
Third, define an interface for an output consumer (OutputConsumer). You'll want to have open() and close() methods, as well as an accept(OutRecord) method.
With this infrastructure in place, you can write your processing method:
public void process(InputSupplier in, OutputConsumer out){
in.open();
out.open();
InRecord inrec;
while ((inrec = in.getNext()) != null){
OutRecord outrec = new OutRecord(in.getAge(), in.getZipCode(), in.getDisease());
out.accept(outrec);
}
out.close();
in.close();
}
Finally, write some "dummy" I/O classes, one that implements InputSupplier and another that implements OutputConsumer. For test purposes, your input supplier can just return a few hand-created records and your output consumer could just print on the console the output records you send it.
Then all you need is a main method to tie it all together:
public static void main(String[] args){
InputSupplier in = new TestInput();// our "dummy" input supplier class
OuputConsumer out = new TestOutput(); // our "dummy" output consumer
process(in, out);
}
For the "real" application you'd write a "real" input supplier class, still implementing the InputSupplier interface, that can read from from a database or an Excel file or whatever input source, and an new output consumer class, still implementing the OutputConsumer interface, that can take output records and store them into whatever appropriate format. Your processing logic won't have to change, because you coded it in terms of InputSupplier and OutputConsumer interfaces. Now just tweak main a bit and you've got your final app:
public static void main(String[] args){
InputSupplier in = new RealInput();// our "real" input supplier class
OuputConsumer out = new RealOutput(); // our "real" output consumer
process(in, out);
}

Count filtered statuses from Twitter

What is the easiest way to count the number of filtered statuses that come in from a twitter stream? I know I can filter statuses using the FilterQuery like so:
FilterQuery fq = new FilterQuery();
String[] array = { "twitter" };
fq.track(array);
twitterStream.filter(fq);
But how would I be able count the number of statuses that come in containing the word twitter? I have tried numerous different ways which have all but failed and only led to all statuses showing up. I even tried to parse json to filter the "text" part in order to count but it became too confusing and did not work.
As you are already filtering for statuses that contain 'twitter' all you need to do is increment a count in the StatusListener#onStatus(Status) method, e.g.:
final AtomicInteger count = new AtomicInteger();
StatusListener listener = new StatusListener() {
#Override
public void onStatus(Status status) {
count.getAndIncrement();
}
// omitted...
}
twitterStream.addListener(listener);
twitterStream.filter(fq);
// wait (to allow statuses to be received) then halt the steam...
System.out.println("received " + count.get() + "statuses in total");
Alternatively you could create a CountingStatusListener that provided you with the count when you were done processing the stream.
Regarding your comment:
For example I want to run the streamer and have it tell me that 7 tweets or whatever with my filtered word in them have come in since the time I ran the streamer.
You probably already know this but the streaming-api provides a real-time view of statuses flowing through Twitter (albeit a sample) so when you stop processing a stream you will miss any statuses sent between the time you stop until you start processing again.
I hope that helps.

Twitter Storm, why is the bolt not recieving what the spout sends

I am using Twitter Storm in a project and I have a strange problem.
I have a spout with the following nextTuple code:
public void nextTuple()
{
HashMap cluster = this.databaseManager.getCluster();
System.out.println("emitted: " + cluster);
this.collector.emit(new Values(cluster));
}
And a bolt which is connected to this spout with the following execute:
public void execute(Tuple tuple)
{
HashMap<String, List<String>> unigrams = (HashMap)tuple.getValueByField("unigrams");
System.out.println("received: " + unigrams);
}
What is emitted should be the same as what is recieved right?
So at first the output shows this:
emitted: {218460=[04ef110987074dc6b3e3174b9f57d980], 1702472=[04ef110987074dc6b3e3174b9f57d980]}
received: {218460=[04ef110987074dc6b3e3174b9f57d980], 1702472=[04ef110987074dc6b3e3174b9f57d980]}
(It is irreverent what the data means, the point is that it is emitted and received).
But then, when the emitted is changed, the received is the same:
emitted: {13788873=[aa2ec732b5b64b25be81abe79d2176bb], 2293158=[aa2ec732b5b64b25be81abe79d2176bb], 218460=[04ef110987074dc6b3e3174b9f57d980], 1702472=[04ef110987074dc6b3e3174b9f57d980]}
received: {218460=[04ef110987074dc6b3e3174b9f57d980], 1702472=[04ef110987074dc6b3e3174b9f57d980]}
I am banging my has to why it doesn't work in the second case.
Furthermore, the output is printing a lot more from the nextTuple than it is from the execute.
Any ideas why this is?
The only solution that I found found was to convert the HashMap to a string, before sending, and then convert back into a HashMap in the bolt.

Categories