Java Apache Storm Spout empty deserialized LinkedList object attribute - java

I have a spout class that has several integer and string attributes, which are serialized/deserialized as expected. The class also has 1 LinkedList holding byte arrays. This LinkedList is always empty when an object is deserialized.
I've added log statements into all of the spout methods and can see the spout's 'activate' method being called, after which, the LinkedList is empty. I do not see any logs when this happens for the 'deactivate' method.
It seems odd that the spout 'activate' method is being called without the 'deactivate' method having been called. When the 'activate' method is called, there has not been any resubmission of the topology.
I also have a log statement in the spout constructor, which is not called prior to the LinkedList being emptied.
I've also verified repeatedly that there are no calls anywhere within the spout class to any method that would completely empty the LinkedList. There is 1 spot that uses the poll method, which is immediately followed by a log statement to log the new LinkedList size.
I found this reference, which points to Kryo being used for Serialization, but it may just be for serializing tuple data.
http://storm.apache.org/documentation/Serialization.html
Storm uses Kryo for serialization. Kryo is a flexible and fast
serialization library that produces small serializations.
By default, Storm can serialize primitive types, strings, byte arrays,
ArrayList, HashMap, HashSet, and the Clojure collection types. If you
want to use another type in your tuples, you'll need to register a
custom serializer.
The article makes it sound like Kryo may be just for serializing and passing tuples, but if it is for the Spout object as well, I can't figure out how to then use a LinkedList as ArrayLists and HashMaps aren't really a good alternative for a FIFO queue. Will I have to roll my own LinkedList?
public class MySpout extends BaseRichSpout
{
private SpoutOutputCollector _collector;
private LinkedList<byte[]> messages = new LinkedList<byte[]>();
public MyObject()
{
queue = new LinkedList<ObjectType>();
}
public void add(byte[] message)
{
messages.add(message);
}
#Override
public void open( Map conf, TopologyContext context,
SpoutOutputCollector collector )
{
_collector = collector;
try
{
Logger.getInstance().addMessage("Opening Spout");
// ####### Open client connection here to read messages
}
catch (MqttException e)
{
e.printStackTrace();
}
}
#Override
public void close()
{
Logger.getInstance().addMessage("Close Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void activate()
{
Logger.getInstance().addMessage("Activate Method Called!!!!!!!!!!!!!!!!!");
}
#Override
public void nextTuple()
{
if (!messages.isEmpty())
{
System.out.println("Tuple emitted from spout");
_collector.emit(new Values(messages.poll()));
Logger.getInstance().addMessage("Tuple emitted from spout. Remaining in queue: " + messages.size());
try
{
Thread.sleep(1);
}
catch (InterruptedException e)
{
// TODO Auto-generated catch block
Logger.getInstance().addMessage("Sleep thread interrupted in nextTuple(). " + Logger.convertStacktraceToString(e));
e.printStackTrace();
}
}
}
}
EDIT:
Java Serialization of referenced objects is "losing values"?
http://www.javaspecialists.eu/archive/Issue088.html
The above SO link and the java specialists article call out specific examples similar to what I am seeing and the issue is do the serialization/deserialization cache. But because Storm is doing that work, I'm not sure what can be done about the issue.
At the end of the day, it also seems like the bigger issue is that Storm is suddenly serializing/deserializing the data in the first place.
EDIT:
Just prior to the Spout being activated, a good number log messages come through in less than a second that read:
Executor MyTopology-1-1447093098:[X Y] not alive
After those messages, there is a log of:
Setting new assignment for topology id MyTopology-1-1447093098: #backtype.storm.daemon.common.Assignment{:master-code-dir ...

If I understand your problem correctly, you instantiate your spout at the client side, add messages via addMessage(), give the spout to the TopologyBuilder via addSpout(), and submit the topology afterwards to your cluster? When the topology is started, you expect the spout message list to contain the messages you added? If this is correct, you usage pattern is quite odd...
I guess the problem is related to Thrift which is used to submit the topology to the cluster. Java serialization is not used and I assume, that the Thrift code does not serialize the actual object. As far as I understand the code, the topology jar is shipped binary, and the topology structure is shipped via Thrift. On the workers that executes the topology, new spout/bolt object are created via new. Thus, no Java serialization/deserialization happens and you LinkedList is empty. Due to the call of new it is of course not null either.
Btw: you are right about Kryo, it is only used to ship data (ie, tuples).
As a work around, you could add the LinkedList to the Map that is given to StormSubmitter.submitTopology(...). In Spout.open(...) you should get a correct copy of your messages from the Map parameter. However, as I mentioned already, your usage pattern is quite odd -- you might want to rethink this. In general, a spout should be implemented in a way, that is can fetch the data in nextTuple() from an external data source.

Related

Flink pipeline with firing results on event

I have stream of objects with address and list of organizations:
#Data
class TaggedObject {
String address;
List<String> organizations;
}
Is there a way to do the following using apache flink:
Merge organization lists for objects with same address
Send all results to Sink when some event occurs. E.g. when user sends control message to a kafka topic or another DataSource
Keep all objects for future accumulations
I tried using global window and custom trigger:
public class MyTrigger extends Trigger<TaggedObject, GlobalWindow> {
#Override
public TriggerResult onElement(TaggedObject element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
if (element instanceof Control) return TriggerResult.FIRE;
else return TriggerResult.CONTINUE;
}
But it seems to give only Control element as a result. Other elements were ignored.
If you want a generic control signal that triggers output for ALL addresses, then you'll need to use a broadcast stream. You combine your stream of addresses with your control stream and then perform the appropriate logic (merging organizations for an address, or triggering output) inside of your custom implementation of a KeyedBroadcastProcessFunction.
It seems like you should just key the stream by address and then use a KeyedProcessFunction (with a List- or MapState) to store the different organizations. Then as soon as an event comes in, you can just output the entries of the State.
Kind Regards
Dominik

How to use interactive query within kafka process topology in spring-cloud-stream?

Is it possible to use interactive query (InteractiveQueryService) within Spring Cloud Stream the class with #EnableBinding annotation or within the method with #StreamListener? I tried instantiating ReadOnlyKeyValueStore within provided KStreamMusicSampleApplication class and process method but its always null.
My #StreamListener method is listening to a bunch of KTables and KStreams and during the process topology e.g filtering, I have to check whether the key from a KStream already exists in a particular KTable.
I tried to figure out how to scan an incoming KTable to check if a key already exists but no luck. Then I came across InteractiveQueryService whose get() method could be used to check if a key exists inside a state store materializedAs from a KTable. The problem is that I can't access it from with the process topology (#EnableBinding or #StreamListener). It can only be accessed from outside these annotation e.g RestController.
Is there a way to scan an incoming KTable to check for the existence of a key or value? if not then can we access InteractiveQueryService within the process topology?
InteractiveQueryService in Spring Cloud Stream is not available to be used within the actual topology in your StreamListener. As you mentioned, it is supposed to be used outside of your main topology. However, with the use case you described, you still can use the state store from your main flow. For example, if you have an incoming KStream and a KTable which is materialized as a state store, then you can call process on the KStream and access the state store that way. Here is a rough code to achieve that. You need to convert this to fit into your specific use case, but here is the idea.
ReadOnlyKeyValueStore<Object, String> store;
input.process(() -> new Processor<Object, Product>() {
#Override
public void init(ProcessorContext processorContext) {
store = (ReadOnlyKeyValueStore) processorContext.getStateStore("my-store");
}
#Override
public void process(Object key, Object value) {
//find the key
store.get(key);
}
#Override
public void close() {
if (state != null) {
state.close();
}
}
}, "my-store");

Subscribe to an Observable without triggering it and then passing it on

This could get a little bit complicated and I'm not that experienced with Observables and the RX pattern so bear with me:
Suppose you've got some arbitrary SDK method which returns an Observable. You consume the method from a class which is - among other things - responsible for retrieving data and, while doing so, does some caching, so let's call it DataProvider. Then you've got another class which wants to access the data provided by DataProvider. Let's call it Consumer for now. So there we've got our setup.
Side note for all the pattern friends out there: I'm aware that this is not MVP, it's just an example for an analogous, but much more complex problem I'm facing in my application.
That being said, in Kotlin-like pseudo code the described situation would look like this:
class Consumer(val provider: DataProvider) {
fun logic() {
provider.getData().subscribe(...)
}
}
class DataProvider(val sdk: SDK) {
fun getData(): Consumer {
val observable = sdk.getData()
observable.subscribe(/*cache data as it passes through*/)
return observable
}
}
class SDK {
fun getData(): Observable {
return fetchDataFromNetwork()
}
}
The problem is, that upon calling sdk.subscribe() in the DataProvider I'm already triggering the Observable's subscribe() method which I don't want. I want the DataProvider to just silently listen - in this example the triggering should be done by the Consumer.
So what's the best RX compatible solution for this problem? The one outlined in the pseudo code above definitely isn't for various reasons one of which is the premature triggering of the network request before the Consumer has subscribed to the Observable. I've experimented with publish().autoComplete(2) before calling subscribe() in the DataProvider, but that doesn't seem to be the canonical way to do this kind of things. It just feels hacky.
Edit: Through SO's excellent "related" feature I've just stumbled across another question pointing in a different direction, but having a solution which could also be applicable here namely flatMap(). I knew that one before, but never actually had to use it. Seems like a viable way to me - what's your opinion regarding that?
If the caching step is not supposed to modify events in the chain, the doOnNext() operator can be used:
class DataProvider(val sdk: SDK) {
fun getData(): Observable<*> = sdk.getData().doOnNext(/*cache data as it passes through*/)
}
Yes, flatMap could be a solution. Moreover you could split your stream into chain of small Observables:
public class DataProvider {
private Api api;
private Parser parser;
private Cache cache;
public Observable<List<User>> getUsers() {
return api.getUsersFromNetwork()
.flatMap(parser::parseUsers)
.map(cache::cacheUsers);
}
}
public class Api {
public Observable<Response> getUsersFromNetwork() {
//makes https request or whatever
}
}
public class Parser {
public Observable<List<User>> parseUsers(Response response) {
//parse users
}
}
public class Cache {
public List<User> cacheUsers(List<User> users) {
//cache users
}
}
It's easy to test, maintain and replace implementations(with usage of interfaces). Also you could easily insert additional step into your stream(for instance log/convert/change data which you receive from server).
The other quite convenient operator is map. Basically instead of Observable<Data> it returns just Data. It could make your code even simpler.

Is there anything in storm watch service?

I had a bolt to which input file keeps on updating. But I can't take updated content since I am reading the file from prepare() method. I want to take updated file without stopping or killing the topology. Is there anything like watch service in Storm to do it? Or any different approach for this?
One approach to your problem is defining a Spout that would periodically check if the file changed. Once it does, it would send a tuple notifying your bolt about a change. The bolt would in turn reload the file. Here are a few hints about implementation:
Topology will contain the new monitoring spout. Your bolt will subscribe to it's stream and to any other stream it needs (bolts can consume multiple streams):
topologyBuilder.setSpout("file_checking_spout", new FileCheckingSpout(myMonitoredFile));
topologyBuilder.setBolt("my_bolt", new MyBolt())
.shuffleGrouping("file_checking_spout")
.shuffleGrouping("whatever other grouping you need");
Spout will do the monitoring. If there is only one file to monitor, you can just emit empty tuples as notification:
public class FileCheckingSpout extends BaseRichSpout {
#Override
public void nextTuple() {
Thread.sleep(500);
if (fileChanged()) { // check e.g. file modified timestamp
collector.emit(new Values());
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields());
}
// ...
}
Your bolt will now have to accept the notifications about file reload. It can distinguish notification tuples e.g. using tuple.getSourceComponent():
class MyBolt implements IRichBolt {
#Override
public void execute(Tuple tuple) {
if ("file_checking_spout".equals(tuple.getSourceComponent())) {
reloadFile();
return;
}
// normal processing
}
//...
}
You could also simply check if the file changed in your bolt's nextTuple(). The way described above is more "the Storm way" as it separates concerns and reloading is not dependent on any other streams.
PS: Naturally, this will work as long as the file is accessible from both spout and bolt, i.e., if you are running in a cluster, it should be on a shared file system.

Record method calls in one session for replaying in future test sessions?

I have a backend system which we use a third-party Java API to access from our own applications. I can access the system as a normal user along with other users, but I do not have godly powers over it.
Hence to simplify testing I would like to run a real session and record the API calls, and persist them (preferably as editable code), so we can do dry test runs later with API calls just returning the corresponding response from the recording session - and this is the important part - without needing to talk to the above mentioned backend system.
So if my application contains line on the form:
Object b = callBackend(a);
I would like the framework to first capture that callBackend() returned b given the argument a, and then when I do the dry run at any later time say "hey, given a this call should return b". The values of a and b will be the same (if not, we will rerun the recording step).
I can override the class providing the API so all the method calls to capture will go through my code (i.e. byte code instrumentation to alter behavior of classes outside my control is not necessary).
What framework should I look into to do this?
EDIT: Please note that bounty hunters should provide actual code demonstrating the behavior I look for.
Actually You can build such framework or template, by using proxy pattern. Here I explain, how you can do it using dynamic proxy pattern. The idea is to,
Write a proxy manager to get recorder and replayer proxies of API on demand!
Write a wrapper class to store your collected information and also implement hashCode and equals method of that wrapper class for efficient lookup from Map like data structure.
And finally use recorder proxy to record and replayer proxy for replaying purpose.
How recorder works:
invokes the real API
collects the invocation information
persists data in expected persistence context
How replayer works:
Collect the method information (method name, parameters)
If collected information matches with previously recorded information then return the previously collected return value.
If returned value does not match, persist the collected information (As you wanted).
Now, lets look at the implementation. If your API is MyApi like bellow:
public interface MyApi {
public String getMySpouse(String myName);
public int getMyAge(String myName);
...
}
Now we will, record and replay the invocation of public String getMySpouse(String myName). To do that we can use a class to store the invocation information like bellow:
public class RecordedInformation {
private String methodName;
private Object[] args;
private Object returnValue;
public String getMethodName() {
return methodName;
}
public void setMethodName(String methodName) {
this.methodName = methodName;
}
public Object[] getArgs() {
return args;
}
public void setArgs(Object[] args) {
this.args = args;
}
public Object getReturnValue() {
return returnType;
}
public void setReturnValue(Object returnValue) {
this.returnValue = returnValue;
}
#Override
public int hashCode() {
return super.hashCode(); //change your implementation as you like!
}
#Override
public boolean equals(Object obj) {
return super.equals(obj); //change your implementation as you like!
}
}
Now Here comes the main part, The RecordReplyManager. This RecordReplyManager gives you proxy object of your API , depending on your need of recording or replaying.
public class RecordReplyManager implements java.lang.reflect.InvocationHandler {
private Object objOfApi;
private boolean isForRecording;
public static Object newInstance(Object obj, boolean isForRecording) {
return java.lang.reflect.Proxy.newProxyInstance(
obj.getClass().getClassLoader(),
obj.getClass().getInterfaces(),
new RecordReplyManager(obj, isForRecording));
}
private RecordReplyManager(Object obj, boolean isForRecording) {
this.objOfApi = obj;
this.isForRecording = isForRecording;
}
#Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
Object result;
if (isForRecording) {
try {
System.out.println("recording...");
System.out.println("method name: " + method.getName());
System.out.print("method arguments:");
for (Object arg : args) {
System.out.print(" " + arg);
}
System.out.println();
result = method.invoke(objOfApi, args);
System.out.println("result: " + result);
RecordedInformation recordedInformation = new RecordedInformation();
recordedInformation.setMethodName(method.getName());
recordedInformation.setArgs(args);
recordedInformation.setReturnValue(result);
//persist your information
} catch (InvocationTargetException e) {
throw e.getTargetException();
} catch (Exception e) {
throw new RuntimeException("unexpected invocation exception: " +
e.getMessage());
} finally {
// do nothing
}
return result;
} else {
try {
System.out.println("replying...");
System.out.println("method name: " + method.getName());
System.out.print("method arguments:");
for (Object arg : args) {
System.out.print(" " + arg);
}
RecordedInformation recordedInformation = new RecordedInformation();
recordedInformation.setMethodName(method.getName());
recordedInformation.setArgs(args);
//if your invocation information (this RecordedInformation) is found in the previously collected map, then return the returnValue from that RecordedInformation.
//if corresponding RecordedInformation does not exists then invoke the real method (like in recording step) and wrap the collected information into RecordedInformation and persist it as you like!
} catch (InvocationTargetException e) {
throw e.getTargetException();
} catch (Exception e) {
throw new RuntimeException("unexpected invocation exception: " +
e.getMessage());
} finally {
// do nothing
}
return result;
}
}
}
If you want to record the method invocation, all you need is getting an API proxy like bellow:
MyApi realApi = new RealApi(); // using new or whatever way get your service implementation (API implementation)
MyApi myApiWithRecorder = (MyApi) RecordReplyManager.newInstance(realApi, true); // true for recording
myApiWithRecorder.getMySpouse("richard"); // to record getMySpouse
myApiWithRecorder.getMyAge("parker"); // to record getMyAge
...
And to replay all you need:
MyApi realApi = new RealApi(); // using new or whatever way get your service implementation (API implementation)
MyApi myApiWithReplayer = (MyApi) RecordReplyManager.newInstance(realApi, false); // false for replaying
myApiWithReplayer.getMySpouse("richard"); // to replay getMySpouse
myApiWithRecorder.getMyAge("parker"); // to replay getMyAge
...
And You are Done!
Edit:
The basic steps of recorder and replayers can be done in above mentioned way. Now its upto you, that how you want to use or perform those steps. You can do what ever you want and whatever you like in the recorder and replayer code blocks and just choose your implementation!
I should prefix this by saying I share some of the concerns in Yves Martin's answer: that such a system may prove frustrating to work with and ultimately less helpful than it would seem at first blush.
That said, from a technical standpoint, this is an interesting problem, and I couldn't not take a go at it. I put together a gist to log method calls in a fairly general way. The CallLoggingProxy class defined there allows usage such as the following.
Calendar original = CallLoggingProxy.create(Calendar.class, Calendar.getInstance());
original.getTimeInMillis(); // 1368311282470
CallLoggingProxy.ReplayInfo replayInfo = CallLoggingProxy.getReplayInfo(original);
// Persist the replay info to disk, serialize to a DB, whatever floats your boat.
// Come back and load it up later...
Calendar replay = CallLoggingProxy.replay(Calendar.class, replayInfo);
replay.getTimeInMillis(); // 1368311282470
You could imagine wrapping your API object with CallLoggingProxy.create prior to passing it into your testing methods, capturing the data afterwards, and persisting it using whatever your favorite serialization system happens to be. Later, when you want to run your tests, you can load the data back up, create a new instance based on the data with CallLoggingProxy.replay, and passing that into your methods instead.
The CallLoggingProxy is written using Javassist, as Java's native Proxy is limited to working against interfaces. This should cover the general use case, but there are a few limitations to keep in mind:
Classes declared final can't be proxied by this method. (Not easily fixable; this is a system limitation)
The gist assumes the same input to a method will always produce the same output. (More easily fixable; the ReplayInfo would need to keep track of sequences of calls for each input instead of single input/output pairs.)
The gist is not even remotely threadsafe (Fairly easily fixable; just requires a little thought and effort)
Obviously the gist is simply a proof of concept, so it's also not been very thoroughly tested, but I believe the general principle is sound. It's also possible there's a more fully baked framework out there to achieve this sort of goal, but if such a thing does exist, I'm not aware of it.
If you do decide to continue with the replay approach, then hopefully this will be enough to give you a possible direction to work in.
I had the same needs some months ago for non-regression testing when planning a heavy technical refactoring of a large application and... I have found nothing available as a framework.
In fact, replaying may be particularly difficult and may only work in a specific context - no (or few) application with a standard complexity can be really considered as stateless. It is a common problem when testing persistence code with a relational database. To be relevant, the complete system initial state must be restored and each replay step must impact the global state the same way. It becomes a challenge when a system state is distributed into pieces like databases, files, memory... Let's guess what happens if a timestamp taken from a system's clock is used somewhere !
So a more pratical option is to only record... and then do a clever comparison for subsequent runs.
Depending of the number of runs you plan, a human-driven session on the application may be enough, or you have to investing in an automated scenario in a robot playing with your application user interface.
First to record: you can use dynamic proxy interface or aspect programming to intercept method call and to capture state before and after invocation. It may mean: dump concerned database tables, copy some files, serialize Java objects in text format like XML.
Then compare this reference capture with a new run. This comparison should be tuned to exclude any irrelevant elements from each piece of state, like row identifiers, timestamps, file names... to only compare data where your backend's added value shines.
Finally nothing really standard, and often a few specific scripts and codes may be enough to achieve the aim: detect as much errors as possible and try to prevent non-expected side-effects.
This can be done with AOP, aspect oriented programming. It allows to intercept method calls by byte code manipulation. Do a bit of search for examples.
In one case this can do recording, in the other replaying.
Pointers: wikipedia, AspectJ, Spring AOP.
Unfortunately one moves a bit outside the java syntax, and a simple example can better be sought elsewhere. With explanation.
Maybe combined with unit tests / some mocking test framework for offline testing with recorded data.
you could look into 'Mockito'
Example:
//You can mock concrete classes, not only interfaces
LinkedList mockedList = mock(LinkedList.class);
//stubbing
when(mockedList.get(0)).thenReturn("first");
when(mockedList.get(1)).thenThrow(new RuntimeException());
//following prints "first"
System.out.println(mockedList.get(0));
//following throws runtime exception
System.out.println(mockedList.get(1));
//following prints "null" because get(999) was not stubbed
System.out.println(mockedList.get(999));
after you could replay each test more times and it will return data that you put in.
// pseudocode
class LogMethod {
List<String> parameters;
String method;
addCallTo(String method, List<String> params):
this.method = method;
parameters = params;
}
}
Have a list of LogMethods and call new LogMethod().addCallTo() before every call in your test method.
The idea of playing back the API calls sounds like a use case for the event sourcing pattern. Martin Fowler has a good article on it here. This is a nice pattern that records events as a sequence of objects which are then stored, you can then replay the sequence of events as required.
There is an implementation of this pattern using Akka called Eventsourced, which may help you build the type of system you require.
I had a similar problem some years ago. None of the above solutions would have worked for methods that are not pure functions (side effect free). The major task is in my opinion:
how to extract a snapshot of the recorded object(s) (not only restricted to objects implementing Serializable)
how to generate test code of a serialized representation in a readable way (not only restricted to beans, primitives and collections)
So I had to go my own way - with testrecorder.
For example, given:
ResultObject b = callBackend(a);
...
ResultObject callBackend(SourceObject source) {
...
}
you will only have to annotate the method like this:
#Recorded
ResultObject callBackend(SourceObject source) {
...
}
and start your application (the one that should be recorded) with the testrecorder agent. Testrecorder will manage all tasks for you, such as:
serializing arguments, results, state of this, exceptions (complete object graph!)
finding a readable representation for object construction and object matching
generating a test from the serialized data
you can extend recordings to global variables, input and output with annotations
An example for the test will look like this:
void testCallBackend() {
//arrange
SourceObject sourceObject1 = new SourceObject();
sourceObject1.setState(...); // testrecorder can use setters but is not limited to them
... // setting up backend
... // setting up globals, mocking inputs
//act
ResultObject resultObject1 = backend.callBackend(sourceObject1);
//assert
assertThat(resultObject, new GenericMatcher() {
... // property matchers
}.matching(ResultObject.class));
... // assertions on backend and sourceObject1 for potential side effects
... // assertions on outputs and globals
}
If I understood you question correctly, you should try db4o.
You will store the objects with db4o and restore later to mock and JUnit tests.

Categories