How to use non-keyed state with Kafka Consumer in Flink? - java

I'm trying to implement (just starting work with Java and Flink) a non-keyed state in KafkaConsumer object, since in this stage no keyBy() in called. This object is the front end and the first module to handle messages from Kafka.
SourceOutput is a proto file representing the message.
I have the KafkaConsumer object :
public class KafkaSourceFunction extends ProcessFunction<byte[], SourceOutput> implements Serializable
{
#Override
public void processElement(byte[] bytes, ProcessFunction<byte[], SourceOutput>.Context
context, Collector<SourceOutput> collector) throws Exception
{
// Here, I want to call to sorting method
collector.collect(output);
}
}
I have an object (KafkaSourceSort) that do all the sorting and should keep the unordered message in priorityQ in the state and also responsible to deliver the message if it comes in the right order thru the collector.
class SessionInfo
{
public PriorityQueue<SourceOutput> orderedMessages = null;
public void putMessage(SourceOutput Msg)
{
if(orderedMessages == null)
orderedMessages = new PriorityQueue<SourceOutput>(new SequenceComparator());
orderedMessages.add(Msg);
}
}
public class KafkaSourceState implements Serializable
{
public TreeMap<String, SessionInfo> Sessions = new TreeMap<>();
}
I read that I need to use a non-keyed state (ListState) which should contain a map of sessions while each session contains a priorityQ holding all messages related to this session.
I found an example so I implement this:
public class KafkaSourceSort implements SinkFunction<KafkaSourceSort>,
CheckpointedFunction
{
private transient ListState<KafkaSourceState> checkpointedState;
private KafkaSourceState state;
#Override
public void snapshotState(FunctionSnapshotContext functionSnapshotContext) throws Exception
{
checkpointedState.clear();
checkpointedState.add(state);
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception
{
ListStateDescriptor<KafkaSourceState> descriptor =
new ListStateDescriptor<KafkaSourceState>(
"KafkaSourceState",
TypeInformation.of(new TypeHint<KafkaSourceState>() {}));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored())
{
state = (KafkaSourceState) checkpointedState.get();
}
}
#Override
public void invoke(KafkaSourceState value, SinkFunction.Context contex) throws Exception
{
state = value;
// ...
}
}
I see that I need to implement an invoke message which probably will be called from processElement() but the signature of invoke() doesn't contain the collector and I don't understand how to do so or even if I did OK till now.
Please, a help will be appreciated.
Thanks.

A SinkFunction is a terminal node in the DAG that is your job graph. It doesn't have a Collector in its interface because it cannot emit anything downstream. It is expected to connect to an external service or data store and send data there.
If you share more about what you are trying to accomplish perhaps we can offer more assistance. There may be an easier way to go about this.

Related

Implementing the Observer Pattern to get notified about every new Account

I am developing an application where I would like to use the observer pattern in the following way:
I have 2 classes:
public abstract class Storage<V>{
private Set<V> values;
private String filename;
protected Storage(String filename) throws ClassNotFoundException, IOException {
values = new HashSet<>();
this.filename = filename;
load();
}
...
public boolean add(V v) throws IllegalArgumentException {
if (values.contains(v))
throw new IllegalArgumentException("L'elemento è già presente");
return values.add(v);
}
...
}
Repository which is a class for saving a collection of Objects. below is a subclass that implements the singleton pattern (the others are practically the same, only the specified generic type changes)
public class AccountStorage extends Storage<Account>{
private static AccountStorage instance = null;
private AccountStorage(String filename) throws ClassNotFoundException, IOException {
super(filename);
}
public static synchronized AccountStorage getInstance() throws ClassNotFoundException, IOException {
if (instance == null) {
String savefile = "accounts.ob";
instance = new AccountStorage(savefile);
}
return instance;
}
after which I have a controller class (Controller for Spring MVC) which through a post request receives an Account in JSON format, deserializes it and adds it to the collection (Tremite the AccountStorage class) like this:
#PostMapping(value = "new/user", consumes = MediaType.APPLICATION_JSON_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)
#ResponseBody
public ResponseEntity<String> newAccount(#RequestBody Account a) {
synchronized (accounts) {
try {
accounts.add(a);
// accounts.save()
} catch (IllegalArgumentException e) {
return new ResponseEntity<String>(e.getMessage(), HttpStatus.BAD_REQUEST);
} catch (IOException e) {
return new ResponseEntity<String>(e.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR);
}
}
}
where accounts is: AccountStorage accounts = AccountStorage.getInstance();
I would like to make sure that, after each addition (or other methods that modify the collection) it is saved to file without calling the function affixed each time after the modification.
My idea is to use the Observer pattern. But I don't know which class must be an Observer and which Observable (assuming this approach is the correct solution).
The common practice for implementing the Observer pattern is to define an Observer interface (Listener) which will declare a general contact and each observer-implementation should provide an action which would be triggered whenever an event occurs.
A subject maintains a collection of observers (listeners), and exposes methods which allow to add and remove (subscribe/unsubscribe) an observer. Event-related behavior resides in the subject, and when a new event happens, every subscribed observer (i.e. each observer that is currently present in the collection) will be notified.
An event to which we are going to listen to is a case when a new Account gets added into an AccountStorage. And AccountStorage would be a subject. That implies that AccountStorage should hold a reference to a collection of observers, provide a functionality to subscribe/unsubscribe and override method add() of the Storage class in order to trigger all the observers when a new account will be added.
Why can't we add a collection of observers and all related functionality into the Storage class so that every implementation will inherit it? It's a valid question, the answer is that in such a scenario we can't be specific in regard to the nature of the event because we even don't know its type - method add(V) expects a mysterious V. Hence, the observer interface and its method would be faceless. It was the downside of the standard interfaces Observer and Observable that are deprecated since JDK version 9. Their names as well as the method-name update() tell nothing about an event that would be observed. It's only slightly better than define an interface MyInterface with a method myMethod() - no clue where you can use it and what actions should follow when myMethod() is fired.
It's a good practice when names of observers are descriptive, so that it's clear without looking at the code what they are meant to do. And it's not only related to the Observer pattern, it is a general practice which is called a self-documenting code.
Let's start by defining an observer interface, I'll call it listener just because AccountAddedListener sounds a bit smoothly, and it's quite common to use the terms listener and observer interchangeably.
public interface AccountAddedListener {
void onAccountAdded(Account account);
}
Now let's proceed with an implementation of the observer, let's say we need a notification manager:
public class NotificationManager implements AccountAddedListener {
#Override
public void onAccountAdded(Account account) {
// send a notification message
}
}
Now it's time to turn the AccountStorage into a subject. It should maintain a reference collection of observers, Set is a good choice because it'll not allow to add the same observer twice (which would be pointless) and is able to add and remove elements in a constant time.
Whenever a new account gets added, subject iterates over the collection of observers and invokes onAccountAdded() method on each of them.
We need to define a method to add a new observer, and it's also good practice to add another one to be able to unregister the observer when it's no longer needed.
public class AccountStorage extends Storage<Account> {
private Set<AccountAddedListener> listeners = new HashSet<>(); // collection of observers
#Override
public boolean add(Account account) throws IllegalArgumentException {
listeners.forEach(listener -> listener.onAccountAdded(account)); // notifying observers
return super.add(account);
}
public boolean registerAccountAddedListener(AccountAddedListener listener) {
return listeners.add(listener);
}
public boolean unregisterAccountAddedListener(AccountAddedListener listener) {
return listeners.remove(listener);
}
// all other functionality of the AccountStorage
}

Spring batch: reader gave one item, processor have to extract many from it [duplicate]

I'm writing a spring batch job and in one of my step I have the following code for the processor:
#Component
public class SubscriberProcessor implements ItemProcessor<NewsletterSubscriber, Account>, InitializingBean {
#Autowired
private AccountService service;
#Override public Account process(NewsletterSubscriber item) throws Exception {
if (!Strings.isNullOrEmpty(item.getId())) {
return service.getAccount(item.getId());
}
// search with email address
List<Account> accounts = service.findByEmail(item.getEmail());
checkState(accounts.size() <= 1, "Found more than one account with email %s", item.getEmail());
return accounts.isEmpty() ? null : accounts.get(0);
}
#Override public void afterPropertiesSet() throws Exception {
Assert.notNull(service, "account service must be set");
}
}
The above code works but I've found out that there are some edge cases where having more than one Account per NewsletterSubscriber is allowed. So I need to remove the state check and to pass more than one Account to the item writer.
One solution I found is to change both ItemProcessor and ItemWriter to deal with List<Account> type instead of Account but this have two drawbacks:
Code and tests are uglier and harder to write and maintain because of nested lists in writer
Most important more than one Account object may be written in the same transaction because a list given to writer may contain multiple accounts and I'd like to avoid this.
Is there any way, maybe using a listener, or replacing some internal component used by spring batch to avoid lists in processor?
Update
I've opened an issue on spring Jira for this problem.
I'm looking into isComplete and getAdjustedOutputs methods in FaultTolerantChunkProcessor which are marked as extension points in SimpleChunkProcessor to see if I can use them in some way to achieve my goal.
Any hint is welcome.
Item Processor takes one thing in, and returns a list
MyItemProcessor implements ItemProcessor<SingleThing,List<ExtractedThingFromSingleThing>> {
public List<ExtractedThingFromSingleThing> process(SingleThing thing) {
//parse and convert to list
}
}
Wrap the downstream writer to iron things out. This way stuff downstream from this writer doesn't have to work with lists.
#StepScope
public class ItemListWriter<T> implements ItemWriter<List<T>> {
private ItemWriter<T> wrapped;
public ItemListWriter(ItemWriter<T> wrapped) {
this.wrapped = wrapped;
}
#Override
public void write(List<? extends List<T>> items) throws Exception {
for (List<T> subList : items) {
wrapped.write(subList);
}
}
}
There isn't a way to return more than one item per call to an ItemProcessor in Spring Batch without getting pretty far into the weeds. If you really want to know where the relationship between an ItemProcessor and ItemWriter exits (not recommended), take a look at the implementations of the ChunkProcessor interface. While the simple case (SimpleChunkProcessor) isn't that bad, if you use any of the fault tolerant logic (skip/retry via FaultTolerantChunkProcessor), it get's very unwieldily quick.
A much simpler option would be to move this logic to an ItemReader that does this enrichment before returning the item. Wrap whatever ItemReader you're using in a custom ItemReader implementation that does the service lookup before returning the item. In this case, instead of returning a NewsletterSubscriber from the reader, you'd be returning an Account based on the previous information.
Instead of returning an Account you return create an AccountWrapper or Collection. The Writer obviously must take this into account :)
You can made transformer to transform your Pojo( Pojo object from file) to your Entity
By making the following code :
public class Intializer {
public static LGInfo initializeEntity() throws Exception {
Constructor<LGInfo> constr1 = LGInfo.class.getConstructor();
LGInfo info = constr1.newInstance();
return info;
}
}
And in your item Processor
public class LgItemProcessor<LgBulkLine, LGInfo> implements ItemProcessor<LgBulkLine, LGInfo> {
private static final Log log = LogFactory.getLog(LgItemProcessor.class);
#SuppressWarnings("unchecked")
#Override
public LGInfo process(LgBulkLine item) throws Exception {
log.info(item);
return (LGInfo) Intializer.initializeEntity();
}
}

Is there a way to send data to a Kafka topic directly from within Processor?

I'm trying to implement the following logic with help of Kafka Streams:
Listen to some reference data from topic eg. ref-data-topic and creates a global StateStore from it.
Listen to messages from another topic data-topic which must be validated against ref data and either sent to success or errors topics.
Here is example pseudocode:
class SomeProcessor implements Processor<String, String> {
private KeyValueStore<String, String> refDataStore;
#Override
public void init(final ProcessorContext context) {
refDataStore = (KeyValueStore) context.getStateStore("ref-data-store");
}
#Override
public void process(String key String value) {
Object refData = refDataStore.get("some_key");
// business logic here
if(ok) {
sendValueToTopic("success");
} else {
sendValueToTopic("errors");
}
}
}
Or what would be the canonical way to achieve such a desired behavior?
Just like an alternative that I have now in my mind is to enrich data within Processor with validation info and send everything then into only one topic, making a client to deal with e.g. validationStatus in the received message.
Although, I really would like to have a solution with two topics because e.g in such a case I could, using Kafka Connect, link success topic directly with some datastore and deal with error topic somehow differently. In the approach with only one topic, again, I have no idea how to achieve this "store_only_successfully_validated_entities" use case.
Any ideas and suggestions?
If you use Processor API, you can forward data to different processor by name:
class SomeProcessor implements Processor<String, String> {
private KeyValueStore<String, String> refDataStore;
private ProcessorContext processorContext;
#Override
public void init(final ProcessorContext context) {
refDataStore = (KeyValueStore) context.getStateStore("ref-data-store");
processorContext = context;
}
#Override
public void process(String key String value) {
Object refData = refDataStore.get("some_key");
// business logic here
if(ok) {
processorContext.forward(key, value, To.child("success"));
} else {
processorContext.forward(key, value, To.child("error"));
}
}
}
When you plug in your topology, you add two sink nodes, names "success" and "error" that write to success and error topic respectively.
Or you forward data to a single sink node and add the sink with a TopicNameExtractor instead of a hard coded topic name. (Requires version 2.0.)
If you use DSL, you can use KStream#branch() to split a stream and pile different data to different topics via KStream#to(...) (or you use the dynamic routing via KStream#to(TopicNameExtractor) -- required version 2.0)

Why am I getting a NotSerializableException here?

I'm trying to map a function across a JavaRDD in spark, and I keep getting NotSerializableError on the map call.
public class SparkPrunedSet extends AbstractSparkSet {
private final ColumnPruner pruner;
public SparkPrunedSet(#JsonProperty("parent") SparkSet parent, #JsonProperty("pruner") ColumnPruner pruner) {
super(parent);
this.pruner = pruner;
}
public JavaRDD<Record> getRdd(SparkContext context) {
JavaRDD<Record> rdd = getParent().getRdd(context);
Function<Record, Record> mappingFunction = makeRecordTransformer(pruner);
//The line below throws the error
JavaRDD<Record> mappedRdd = rdd.map(mappingFunction);
return mappedRdd;
}
private Function<Record, Record> makeRecordTransformer() {
return new Function<Record, Record>() {
private static final long serialVersionUID = 1L;
#Override
public Record call(Record record) throws Exception {
// Obviously i'd like to do something more useful in here, but this is enough
// to throw the error
return record;
}
};
}
}
When it runs, I get:
java.io.NotSerializableException: com.package.SparkPrunedSet
Record is an interface that implements serializable, and MapRecord is an implementation of it. Similar code to this exists and works in the codebase, except it's using rdd.filter instead. I've read through most of the other stack overflow entries on this, and none of them seem to help. I thought it might have to do with troubles serializing SparkPrunedSet (although I don't understand why it would even need to do this), so I set all of the fields on it to transient, but that didn't help either. Does anyone have any ideas?
The Function you are creating for the transformation is, in fact, an (anonymous) inner class of SparkPrunedSet. Therefore every instance of that function has an implicit reference to the SparkPrunedSet object that created it.
Therefore, serialization of it will require serialization of SparkPrunedSet.

How to properly test with mocks Akka actors in Java?

I'm very new with Akka and I'm trying to write some unit tests in Java. Consider the following actor:
public class Worker extends UntypedActor {
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof Work) {
Work work = (Work) message;
Result result = new Helper().processWork(work);
getSender().tell(result, getSelf());
} else {
unhandled(message);
}
}
}
What is the proper way to intercept the call new Helper().processWork(work)? On a side note, is there any recommended way to achieve dependency injection within Akka actors with Java?
Thanks in advance.
Your code is already properly testable:
you can test your business logic separately, since you can just instantiate your Helper outside of the actor
once you are sure that the Helper does what it is supposed to do, just send some inputs to the actor and observe that the right replies come back
Now if you need to have a “mocked” Worker to test some other component, just don’t use a Worker at all, use a TestProbe instead. Where you would normally get the ActorRef of the Worker, just inject probe.getRef().
So, how to inject that?
I’ll assume that your other component is an Actor (because otherwise you won’t have trouble applying whatever injection technique you normally use). Then there are three basic choices:
pass it in as constructor argument
send it within a message
if the actor creates the ref as its child, pass in the Props, possibly in an alternative constructor
The third case is probably what you are looking at (I’m guessing based on the actor class’ name):
public class MyParent extends UntypedActor {
final Props workerProps;
public MyParent() {
workerProps = new Props(...);
}
public MyParent(Props p) {
workerProps = p;
}
...
getContext().actorOf(workerProps, "worker");
}
And then you can inject a TestProbe like this:
final TestProbe probe = new TestProbe(system);
final Props workerMock = new Props(new UntypedActorFactory() {
public UntypedActor create() {
return new UntypedActor() {
#Override
public void onReceive(Object msg) {
probe.getRef().tell(msg, getSender());
}
};
}
});
final ActorRef parent = system.actorOf(new Props(new UntypedActorFactory() {
public UntypedActor create() {
return new MyParent(workerMock);
}
}), "parent");

Categories