Java Flink External Source

Java Flink External Source - java

I would like to have a parallel Flink source that consumes from an in-memory blocking queue. My idea is to have the application pushing elements into this queue and the Flink pipeline consumes and process them.
What is the best pattern to follow for this? I've looked at some Flink sources implementations (like Kafka, RabbitMQ, etc) and all of them are initialising the connections required from within the source instance. I cannot do this (i.e., initialise the queue from within each source instance), since
each source instance instance would create its own queue.
need a reference to the queue from outside of Flink to push elements to it.
Currently, I have came up with the following, but the use of static queues doesn't feel right to me.
1. A queue from where each Flink source instance is getting its elements.
public class TheQueue implements Serializable {
private static final Logger LOGGER = LoggerFactory.getLogger(TheQueue.class);
private transient static final BlockingQueue<Object> OBJECT_QUEUE = new LinkedBlockingQueue<>();
public static SerializableSupplier<Object> getObjectConsumer() {
return () -> {
return OBJECT_QUEUE.take();
}
};
}
2. My Flink pipeline excerpt.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(10);
env.addSource(TestParallelSourceFunction.getInstance(TheQueue.getObjectConsumer()))
3. The Flink source function.
public class TestParallelSourceFunction<T> extends RichParallelSourceFunction<T>{
private static final Logger LOGGER = LoggerFactory.getLogger(TestParallelSourceFunction.class);
private SerializableSupplier<T> supplier;
// initialisation code
#Override
public void run(final SourceContext<T> ctx) throws Exception {
LOGGER.info("Starting Flink source.");
isRunning = true;
while (isRunning) {
final T t = supplier.get();
if (t != null) {
ctx.collect(t);
}
}
LOGGER.info("Stopped Flink source.");
}

Your understanding of message queue systems like Kafka and RabbitMQ and their role in streaming applications is flawed, I think. They are standalone services that exist outside of Flink. Flink doesn't start or configure them, it just opens connections to read from them.
So the idea would be that you start a Kafka cluster, and give the necessary connection details and topic names to both your Flink job and whatever application is pushing elements into Kafka. The application pushing elements onto the queue talks to the Kafka cluster over tcpip, and so does Flink.

The problem is (from my understanding) is that Flink takes all the operators and serializes them, sends to a "worker" which deserializes them.
This is why usually the sources are creating a connection inside of them and do not receive an external connection.
What you can do if you run the Flink pipeline inside your process (local execution environment) is to create a class which extends the RichSource function, has an ID as a serializable field and a static map between the ID and the blocking queue. It will look something like so (writing it without an IDE so syntax might be slightly off):
public class BlockingQueueSource<T> extends RichSourceFunction<T> {
private static final Map<String, BlockingQueue<T>> idToQueue;
private final String id;
private volatile boolean isRunning;
public BlockingQueueSource(String id) {
this.id = id;
this.isRunning = true;
}
#Override
public void open(...) {
idToQueue.put(id, new LinkedBlockingQueue<>());
}
public void close() {
isRunning = false;
idToQueue.remove(id);
}
public void run(SourceContext<T> context) {
BlockingQueue<T> queue = idToQueue.get(id);
while(isRunning) {
T item = queue.take();
context.collect(item);
}
}
public void addItem(T item) {
idToQueue.get(id).put(item);
}
}
Again, this will work only if the source is located in the same process where you created all the Flink pipeline, meaning you run it with local execution environment.

Related

How to create pool of clients which can handle just one task at once

My application starts couple of clients which communicate with steam. There are two types of task which I can ask for clients. One when I don't care about blocking for example ask client about your friends. But second there are tasks which I can submit just one to client and I need to wait when he finished it asynchronously. So I am not sure if there is already some design pattern but you can see what I already tried. When I ask for second task I removed it from queue and return it here after this task is done. But I don't know if this is good sollution because I can 'lost' some clients when I do something wrong
#Component
public class SteamClientWrapper {
private Queue<DotaClientImpl> clients = new LinkedList<>();
private final Object clientLock = new Object();
public SteamClientWrapper() {
}
#PostConstruct
public void init() {
// starting clients here clients.add();
}
public DotaClientImpl getClient() {
return getClient(false);
}
public DotaClientImpl getClient(boolean freeLast) {
synchronized (clients) {
if (!clients.isEmpty()) {
return freeLast ? clients.poll() : clients.peek();
}
}
return null;
}
public void postClient(DotaClientImpl client) {
if (client == null) {
return;
}
synchronized (clientLock) {
clients.offer(client);
clientLock.notify();
}
}
public void doSomethingBlocking() {
DotaClientImpl client = getClient(true);
client.doSomething();
}
}

Sounds like you could use Spring's ThreadPoolTaskExecutor to do that.
An Executor is basically what you tried to do - store tasks in a queue and process the next as soon the previous has finished.
Often this is used to run tasks in parallel, but it can also reduce overhead for serial processing.
A sample doing it this way would be on
https://dzone.com/articles/spring-and-threads-taskexecutor
To ensure only one client task runs at a time, simply set the configuration to
executor.setCorePoolSize(1);
executor.setMaxPoolSize(1);

flink SourceFunction<> is being replaced in StreamExecutionEnvironment.addSource()?

I ran into this problem when I was trying to create a custom source of event. Which contains a queue that allow my other process to add items into it. Then expect my CEP pattern to print some debug messages when there is a match.
But there is no match no matter what I add to the queue. Then I notice that the queue inside mySource.run() is always empty. Which means the queue I used to create the mySource instance is not the same as the one inside StreamExecutionEnvironment. If I change the queue to static, force all instances to share the same queue, everything works as expected.
DummySource.java
public class DummySource implements SourceFunction<String> {
private static final long serialVersionUID = 3978123556403297086L;
// private static Queue<String> queue = new LinkedBlockingQueue<String>();
private Queue<String> queue;
private boolean cancel = false;
public void setQueue(Queue<String> q){
queue = q;
}
#Override
public void run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext<String> ctx)
throws Exception {
System.out.println("run");
synchronized (queue) {
while (!cancel) {
if (queue.peek() != null) {
String e = queue.poll();
if (e.equals("exit")) {
cancel();
}
System.out.println("collect "+e);
ctx.collectWithTimestamp(e, System.currentTimeMillis());
}
}
}
}
#Override
public void cancel() {
System.out.println("canceled");
cancel = true;
}
}
So I dig into the source code of StreamExecutionEnvironment. Inside the addSource() method. There is a clean() method which looks like it replaces the instance to a new one.
Returns a "closure-cleaned" version of the given function.
Why is that? and Why it needs to be serialize?
I've also try to turn off the clean closure using getConfig(). The result is still the same. My queue instance is not the same one which env is using.
How do I solve this problem?

The clean() method used on functions in Flink is mainly to ensure the Function(like SourceFunction, MapFunction) serialisable. Flink will serialise those functions and distribute them onto task nodes to execute them.
For simple variables in your Flink main code, like int, you can simply reference them in your function. But for the large or not-serialisable ones, better using broadcast and rich source function. Please refer to https://cwiki.apache.org/confluence/display/FLINK/Variables+Closures+vs.+Broadcast+Variables

Reactor choosing a sink/processor

I have the following use case:
Out of an application I am consuming with X threads some messages, where I have a Consumer implementation defined like that:
public interface Consumer {
onMessage(Object message);
}
The problem is that Consumer is not a different instance per thread, but a single instance, as it is a Spring bean and we also expect it not to have side effects per single call of onMessage.
However, what I want to build is a duplicate message detection mechanism, which kind of looks like this:
public static <T> Flux<OcurrenceCache<T>> getExceedingRates(Flux<T> values, int maxHits, int bufferSize, Duration bufferTimeout) {
return values.bufferTimeout(bufferSize, bufferTimeout)
.map(vals -> {
OcurrenceCache<T> occurrenceCache = new OcurrenceCache<>(maxHits);
for (T value : vals) {
occurrenceCache.incrementNrOccurrences(value);
}
return occurrenceCache;
});
}
Where basically from a Flux of values I am returning an occurrence cache with the elements that are encountered more than the max desired number of hits.
Naively, I can implement things like that:
public class MyConsumer implements Consumer {
private final EmitterProcessor<Object> emitterProcessor;
public MyConsumer(Integer maxHits, Integer bufferSize, Long timeoutMillis){
this.emitterProcessor = EmitterProcessor.create();
this.emitterProcessor
.bufferTimeout(bufferSize, Duration.ofMillis(timeoutMillis))
.subscribe(integers -> {
getExceedingRates(Flux.fromIterable(integers), maxHits, bufferSize, Duration.ofMillis(timeoutMillis))
.subscribe(integerOcurrenceCache -> {
System.out.println(integerOcurrenceCache.getExceedingValues());
});
});
}
#Override
public void onMessage(Object message){
emitterProcessor.onNext(message);
}
}
However, this is far from optimal because I know that my messages from a specific thread will NEVER contain any the messages that came from another thread (they are pre-grouped as we use jms grouping and kinesis sharding). So, in a way, I'd like to use such a Processor that will:
use the very same thread on which onMessage was called to kind of isolate the flux in such a way where values from it are isolated and not mixed up with the variables put from another thread.

You can use thread local processors:
private final ThreadLocal<EmitterProcessor<Object>> emitterProcessorHolder = ThreadLocal.withInitial(() -> {
EmitterProcessor<Object> processor = ...
return processor;
});
...
#Override
public void onMessage(Object message){
emitterProcessorHolder.get().onNext(message);
}

Listening to a jms queue and processing only 10 messages at a time

I have a javax.jms.Queue queue and have my listener listening to this queue. I get the message(a String) and execute a process passing the string as an input parameter to that process.
I want to just run 10 instances of that process running at one time. Once those are finished then only next messages should be processed.
How it can be achieved? As it reads all the message at once and runs as many instances of that process running, causing the server to be hanged.
// using javax.jms.MessageListener
message = consumer.receive(5000);
if (message != null) {
try {
handler.onMessage(message); //handler is MessageListener instance
}
}

Try to put this annotation on your mdb listener:
#ActivationConfigProperty(propertyName = "maxSession", propertyValue = "10")

I am assuming that you have a way of accepting hasTerminated messages from your external processes. This controller thread will communicate with the JMS listener using a Semaphore. The Semaphore is initialized with 10 permits, and every time an external process calls TerminationController#terminate (or however the external processes communicate with your listener process) it adds a permit to the Semaphore, and then JMSListener must first acquire a permit before it can call messageConsumer.release() which ensures that no more than ten processes can be active at a time.
// created in parent class
private final Semaphore semaphore = new Semaphore(10);
#Controller
public class TerminationController {
private final semaphore;
public TerminationController(Semaphore semaphore) {
this.semaphore = semaphore;
}
// Called from external processes when they terminate
public void terminate() {
semaphore.release();
}
}
public class JMSListener implements Runnable {
private final MessageConsumer messageConsumer;
private final Semaphore semaphore;
public JMSListener(MessageConsumer messageConsumer, Semaphore semaphore) {
this.messageConsumer = messageConsumer;
this.semaphore = semaphore;
}
public void run() {
while(true) {
semaphore.acquire();
Message message = messageConsumer.receive();
// create process from message
}
}
}

I think a simple while check would suffice. Here's some Pseudocode.
While (running processes are less than 10) {
add one to the running processes list
do something with the message
}
and in the code for onMessage:
function declaration of on Message(Parameters) {
do something
subtract 1 from the running processes list
}
Make sure that the variable you're using to count the amount of running processes is declared as volatile.
Example as requested:
public static volatile int numOfProcesses = 0;
while (true) {
if (numOfProcesses < 10) {
// read a message and make a new process, etc
// probably put your receive code here
numOfProcesses++;
}
}
Wherever your the code for your processes is written:
// do stuff, do stuff, do more stuff
// finished stuff
numOfProcesses--;

When to use Akka and when not to?

I'm currently in the situation that I'm actually making things more complicated by using Actors then when I don't. I need to execute a lot of Http Requests without blocking the Main thread. Since this is concurrency and I wanted to try something different then locks, I decided to go with Akka. Now I'm in the situation that I'm doubting between two approaches.
Approach one (Create new Actors when it's in need):
public class Main {
public void start() {
ActorSystem system = ActorSystem.create();
// Create 5 Manager Actors (Currently the same Actor for all but this is different in actual practise)
ActorRef managers = system.actorOf(new BroadcastPool(5).props(Props.create(Actor.class)));
managers.tell(new Message(), ActorRef.noSender());
}
}
public class Actor extends UntypedActor {
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof Message) {
ActorRef ref = getContext().actorOf(new SmallestMailboxPool(10).props(Props.create(Actor.class)));
// Repeat the below 10 times
ref.tell(new Message2(), getSelf());
} else if (message instanceof Message2) {
// Execute long running Http Request
}
}
}
public final class Message {
public Message() {
}
}
public final class Message2 {
public Message2() {
}
}
Approach two (Create a whole lot of actors before hand and hope it's enough):
public class Main {
public void start() {
ActorSystem system = ActorSystem.create();
ActorRef actors = system.actorOf(new SmallestMailboxPool(100).props(Props.create(Actor.class)));
ActorRef managers = system.actorOf(new BroadcastPool(5).props(Props.create(() -> new Manager(actors))));
managers.tell(new Message(), ActorRef.noSender());
}
}
public class Manager extends UntypedActor {
private ActorRef actors;
public Manager(ActorRef actors) {
this.actors = actors;
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof Message) {
// Repeat 10 times
actors.tell(new Message2(), getSelf());
}
}
}
public class Actor extends UntypedActor {
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof Message2) {
// Http request
}
}
}
public final class Message {
public Message() {
}
}
public final class Message2 {
public Message2() {
}
}
So both approaches have up and down sides. One makes sure it can always handle new requests coming in, those never have to wait. But it leaves behind a lot of Actors that are never gonna be used. Two on the hand reuses Actors but with the downside that it might not have enough of them and can't cope some time in the future and has to queue the messages.
What is the best approach of solving this and what is most common way people deal with this?
If you think I could be doing this sort of stuff a lot better (with or without Akka) please tell me! I'm pretty new to Akka and would love to learn more about it.

Based on the given information, it looks like a typical example for task-based concurrency -- not for actor-based concurrency. Imagine you have a method for doing the HTTP request. The method fetches the given URL and returns an object without causing any data races on shared memory:
private static Page loadPage(String url) {
// ...
}
You can easily fetch the pages concurrently with an Executor. There are different kinds of Executors, e.g. you can use one with a fixed number of threads.
public static void main(String... args) {
ExecutorService executor = Executors.newFixedThreadPool(5);
List<Future<Page>> futures = new ArrayList<>();
// submit tasks
for (String url : args) {
futures.add(executor.submit(() -> loadPage(url)));
}
// access result of tasks (or wait until it is available)
for (Future<Page> future : futures) {
Page page = future.get();
// ...
}
executor.shutdown();
}
There is no further synchronization required. The Executor framework takes care of that.

I'd use mixed approach: create relatively small pool of actors beforehand, increase it when needed, but keep pool's size limited (deny request when there are too many connections, to avoid crash due to out of memory).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Flink External Source - java

Related

How to create pool of clients which can handle just one task at once

flink SourceFunction<> is being replaced in StreamExecutionEnvironment.addSource()?

Reactor choosing a sink/processor

Listening to a jms queue and processing only 10 messages at a time

When to use Akka and when not to?

Categories

Resources