flink SourceFunction<> is being replaced in StreamExecutionEnvironment.addSource()? - java

I ran into this problem when I was trying to create a custom source of event. Which contains a queue that allow my other process to add items into it. Then expect my CEP pattern to print some debug messages when there is a match.
But there is no match no matter what I add to the queue. Then I notice that the queue inside mySource.run() is always empty. Which means the queue I used to create the mySource instance is not the same as the one inside StreamExecutionEnvironment. If I change the queue to static, force all instances to share the same queue, everything works as expected.
DummySource.java
public class DummySource implements SourceFunction<String> {
private static final long serialVersionUID = 3978123556403297086L;
// private static Queue<String> queue = new LinkedBlockingQueue<String>();
private Queue<String> queue;
private boolean cancel = false;
public void setQueue(Queue<String> q){
queue = q;
}
#Override
public void run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext<String> ctx)
throws Exception {
System.out.println("run");
synchronized (queue) {
while (!cancel) {
if (queue.peek() != null) {
String e = queue.poll();
if (e.equals("exit")) {
cancel();
}
System.out.println("collect "+e);
ctx.collectWithTimestamp(e, System.currentTimeMillis());
}
}
}
}
#Override
public void cancel() {
System.out.println("canceled");
cancel = true;
}
}
So I dig into the source code of StreamExecutionEnvironment. Inside the addSource() method. There is a clean() method which looks like it replaces the instance to a new one.
Returns a "closure-cleaned" version of the given function.
Why is that? and Why it needs to be serialize?
I've also try to turn off the clean closure using getConfig(). The result is still the same. My queue instance is not the same one which env is using.
How do I solve this problem?

The clean() method used on functions in Flink is mainly to ensure the Function(like SourceFunction, MapFunction) serialisable. Flink will serialise those functions and distribute them onto task nodes to execute them.
For simple variables in your Flink main code, like int, you can simply reference them in your function. But for the large or not-serialisable ones, better using broadcast and rich source function. Please refer to https://cwiki.apache.org/confluence/display/FLINK/Variables+Closures+vs.+Broadcast+Variables

Related

Java Flink External Source

I would like to have a parallel Flink source that consumes from an in-memory blocking queue. My idea is to have the application pushing elements into this queue and the Flink pipeline consumes and process them.
What is the best pattern to follow for this? I've looked at some Flink sources implementations (like Kafka, RabbitMQ, etc) and all of them are initialising the connections required from within the source instance. I cannot do this (i.e., initialise the queue from within each source instance), since
each source instance instance would create its own queue.
need a reference to the queue from outside of Flink to push elements to it.
Currently, I have came up with the following, but the use of static queues doesn't feel right to me.
1. A queue from where each Flink source instance is getting its elements.
public class TheQueue implements Serializable {
private static final Logger LOGGER = LoggerFactory.getLogger(TheQueue.class);
private transient static final BlockingQueue<Object> OBJECT_QUEUE = new LinkedBlockingQueue<>();
public static SerializableSupplier<Object> getObjectConsumer() {
return () -> {
return OBJECT_QUEUE.take();
}
};
}
2. My Flink pipeline excerpt.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(10);
env.addSource(TestParallelSourceFunction.getInstance(TheQueue.getObjectConsumer()))
3. The Flink source function.
public class TestParallelSourceFunction<T> extends RichParallelSourceFunction<T>{
private static final Logger LOGGER = LoggerFactory.getLogger(TestParallelSourceFunction.class);
private SerializableSupplier<T> supplier;
// initialisation code
#Override
public void run(final SourceContext<T> ctx) throws Exception {
LOGGER.info("Starting Flink source.");
isRunning = true;
while (isRunning) {
final T t = supplier.get();
if (t != null) {
ctx.collect(t);
}
}
LOGGER.info("Stopped Flink source.");
}
Your understanding of message queue systems like Kafka and RabbitMQ and their role in streaming applications is flawed, I think. They are standalone services that exist outside of Flink. Flink doesn't start or configure them, it just opens connections to read from them.
So the idea would be that you start a Kafka cluster, and give the necessary connection details and topic names to both your Flink job and whatever application is pushing elements into Kafka. The application pushing elements onto the queue talks to the Kafka cluster over tcpip, and so does Flink.
The problem is (from my understanding) is that Flink takes all the operators and serializes them, sends to a "worker" which deserializes them.
This is why usually the sources are creating a connection inside of them and do not receive an external connection.
What you can do if you run the Flink pipeline inside your process (local execution environment) is to create a class which extends the RichSource function, has an ID as a serializable field and a static map between the ID and the blocking queue. It will look something like so (writing it without an IDE so syntax might be slightly off):
public class BlockingQueueSource<T> extends RichSourceFunction<T> {
private static final Map<String, BlockingQueue<T>> idToQueue;
private final String id;
private volatile boolean isRunning;
public BlockingQueueSource(String id) {
this.id = id;
this.isRunning = true;
}
#Override
public void open(...) {
idToQueue.put(id, new LinkedBlockingQueue<>());
}
public void close() {
isRunning = false;
idToQueue.remove(id);
}
public void run(SourceContext<T> context) {
BlockingQueue<T> queue = idToQueue.get(id);
while(isRunning) {
T item = queue.take();
context.collect(item);
}
}
public void addItem(T item) {
idToQueue.get(id).put(item);
}
}
Again, this will work only if the source is located in the same process where you created all the Flink pipeline, meaning you run it with local execution environment.

Consumer-Producer with Threads and BlockingQueues

I wrote a Class 'Producer' which is continuously parsing files from a specific folder. The parsed result will be stored in queue for the Consumer.
public class Producer extends Thread
{
private BlockingQueue<MyObject> queue;
...
public void run()
{
while (true)
{
//Store email attachments into directory
...
//Fill the queue
queue.put(myObject);
sleep(5*60*1000);
}
}
}
My Consumer Class is continuously checking if there is something available in the queue. If so, it's performing some work on the parsed result.
public class Consumer extends Thread
{
private BlockingQueue<MyObject> queue;
...
public void run()
{
while (true)
{
MyObject o = queue.poll();
// Work on MyObject 'o'
...
sleep(5*60*1000);
}
}
}
When I run my programm, 'top' shows that the JAVA process is always on 100%. I guess it's because of the infinite loops.
Is this a good way to implement this or is there a more resource saving way for doing this?
Instead of
MyObject o = queue.poll();
try
MyObject o = queue.take();
The latter will block until there is something available in the queue, whereas the former will always return immediately, whether or not something is available.

Manually trigger a #Scheduled method

I need advice on the following:
I have a #Scheduled service method which has a fixedDelay of a couple of seconds in which it does scanning of a work queue and processing of apropriate work if it finds any. In the same service I have a method which puts work in the work queue and I would like this method to imediately trigger scanning of the queue after it's done (since I'm sure that there will now be some work to do for the scanner) in order to avoid the delay befor the scheduled kicks in (since this can be seconds, and time is somewhat critical).
An "trigger now" feature of the Task Execution and Scheaduling subsystem would be ideal, one that would also reset the fixedDelay after execution was initiated maually (since I dont want my manual execution to collide with the scheduled one). Note: work in the queue can come from external source, thus the requirement to do periodic scanning.
Any advice is welcome
Edit:
The queue is stored in a document-based db so local queue-based solutions are not appropriate.
A solution I am not quite happy with (don't really like the usage of raw threads) would go something like this:
#Service
public class MyProcessingService implements ProcessingService {
Thread worker;
#PostCreate
public void init() {
worker = new Thread() {
boolean ready = false;
private boolean sleep() {
synchronized(this) {
if (ready) {
ready = false;
} else {
try {
wait(2000);
} catch(InterruptedException) {
return false;
}
}
}
return true;
}
public void tickle() {
synchronized(this) {
ready = true;
notify();
}
}
public void run() {
while(!interrupted()) {
if(!sleep()) continue;
scan();
}
}
}
worker.start();
}
#PreDestroy
public void uninit() {
worker.interrup();
}
public void addWork(Work work) {
db.store(work);
worker.tickle();
}
public void scan() {
List<Work> work = db.getMyWork();
for (Work w : work) {
process();
}
}
public void process(Work work) {
// work processing here
}
}
Since the #Scheduled method wouldn't have any work to do if there are no items in the work-queue, that is, if no one put any work in the queue between the execution cycles. On the same note, if some work-item was inserted into the work-queue (by an external source probably) immediately after the scheduled-execution was complete, the work won't be attended to until the next execution.
In this scenario, what you need is a consumer-producer queue. A queue in which one or more producers put in work-items and a consumer takes items off the queue and processes them. What you want here is a BlockingQueue. They can be used for solving the consumer-producer problem in a thread-safe manner.
You can have one Runnable that performs the tasks performed by your current #Scheduled method.
public class SomeClass {
private final BlockingQueue<Work> workQueue = new LinkedBlockingQueue<Work>();
public BlockingQueue<Work> getWorkQueue() {
return workQueue;
}
private final class WorkExecutor implements Runnable {
#Override
public void run() {
while (true) {
try {
// The call to take() retrieves and removes the head of this
// queue,
// waiting if necessary until an element becomes available.
Work work = workQueue.take();
// do processing
} catch (InterruptedException e) {
continue;
}
}
}
}
// The work-producer may be anything, even a #Scheduled method
#Scheduled
public void createWork() {
Work work = new Work();
workQueue.offer(work);
}
}
And some other Runnable or another class might put in items as following:
public class WorkCreator {
#Autowired
private SomeClass workerClass;
#Override
public void run() {
// produce work
Work work = new Work();
workerClass.getWorkQueue().offer(work);
}
}
I guess that's the right way to solve the problem you have at hand. There are several variations/configurations that you can have, just look at the java.util.concurrent package.
Update after question edited
Even if the external source is a db, it is still a producer-consumer problem. You can probably call the scan() method whenever you store data in the db, and the scan() method can put the data retrieved from the db into the BlockingQueue.
To address the actual thing about resetting the fixedDelay
That is not actually possible, wither with Java, or with Spring, unless you handle the scheduling part yourself. There is no trigger-now functionality as well. If you have access to the Runnable that's doing the task, you can probably call the run() method yourself. But that would be the same as calling the processing method yourself from anywhere and you don't really need the Runnable.
Another possible workaround
private Lock queueLock = new ReentrantLock();
#Scheduled
public void findNewWorkAndProcess() {
if(!queueLock.tryLock()) {
return;
}
try {
doWork();
} finally {
queueLock.unlock();
}
}
void doWork() {
List<Work> work = getWorkFromDb();
// process work
}
// To be called when new data is inserted into the db.
public void newDataInserted() {
queueLock.lock();
try {
doWork();
} finally {
queueLock.unlock();
}
}
the newDataInserted() is called when you insert any new data. If the scheduled execution is in progress, it will wait until it is finished and then do the work. The call to lock() here is blocking since we know that there is some work in the database and the scheduled-call might have been called before the work was inserted. The call to acquire lock in findNewWorkAndProcess() in non-blocking as, if the lock has been acquired by the newDataInserted method, it would mean that the scheduled method shouldn't be executed.
Well, you can fine tune as you like.

How often is a thread executed? My Observer pattern gone wrong?

The following is a simplified version of my current code. I am pretty sure I am not doing any thing wrong syntax-wise, and I can't locate my conceptual mistake.
This is sort of an observer pattern I tried to implement. I could not afford to inherit from Java.utils.observable as my class is already complicated and inherits from another class.
There are two parts here:
There's a Notifier class implementing Runnable :
public class Notifier implements Runnable{
public void run()
{
while(true)
{
MyDataType data = getData();
if(data.isChanged()==true)
{
refresh();
}
}
}
}
And then there is my main class which needs to respond to changes to MyDataType data.
public class abc {
private MyDataType data;
public void abc(){
Notifier notifier = new Notifier();
Thread thread = new Thread(notifier);
thread.start();
}
public MyDataType getData(){
return this.data;
}
public void refresh(){
MyDatatype data = getData();
//Do something with data
}
}
The problem : What's happening is that the notifier is calling refresh() when 'data' changes. However inside refresh(), when I do getData(), I am getting the old version of 'data'!
I should mention that there are other parts of the code which are calling the refresh() function too.
What am I overlooking?
Any other better solutions to this problem?
How should I approach designing Subject-Observer systems if I can't apply the default Java implementation out of the box?
when I do getData(), I am getting the old version of 'data'!
Your data field is shared among more than one thread so it must be marked with the volatile keyword.
private volatile MyDataType data;
This causes a "memory barrier" around the read and the the write that keeps the value visible to all threads. Even though the notifier thread is calling getData(), the value for data is being retrieved out if its memory cache. Without the memory barrier, the data value will be updated randomly or never.
As #JB mentioned in the comments, the volatile protects you against a re-assignment of the data field. If you update one of the fields within the current data value, the memory barrier will not be crossed that the notifier's memory will not be updated.
Looking back at your code, it looks like this is the case:
if(data.isChanged()==true)
{
refresh();
}
If data is not being assigned to a new object then making data to be volatile won't help you. You will have to:
Set some sort of volatile boolean dirty; field whenever data has been updated.
Update or read data within a synchronize block each and every time.
First, your data variable might be cached, so you will always need to get the latest value by making it volatile.
Second, what you are doing here is a producer / consumer pattern. This pattern is usually best implemented with messages. When you receive new data, you could create an immutable object and post it to the consumer thread (via a thread safe queue like a BlockingQueue) instead of having a shared variable.
Something along these lines:
public class Notifier extends Thread{
private BlockingQueue<E> consumerQueue = null;
public setConsumerQueue(BlockingQueue<E> val){
consumerQueue = val;
}
// main method where data is received from socket...
public void run(){
while(!interrupted()){
data = ... // got new data here
if(!data.isChanged()) continue;
// Post new data only when it has changed
if(consumerQueue!=null) consumerQueue.offer(data);
}
}
}
public class Consumer extends Thread{
private BlockingQueue<E> consumerQueue = new BlockingQueue<E>();
public Consumer (Producer val){
val.setConsumerQueue(consumerQueue);
}
public void run(){
while(!interrupted()){
data = consumerQueue.take();// block until there is data from producer
if(data !=null) processData(data);
}
}
}

Queue with notifications on isEmpty() changes

I have an BlockingQueue<Runnable>(taken from ScheduledThreadPoolExecutor) in producer-consumer environment. There is one thread adding tasks to the queue, and a thread pool executing them.
I need notifications on two events:
First item added to empty queue
Last item removed from queue
Notification = writing a message to database.
Is there any sensible way to implement that?
A simple and naïve approach would be to decorate your BlockingQueue with an implementation that simply checks the underlying queue and then posts a task to do the notification.
NotifyingQueue<T> extends ForwardingBlockingQueue<T> implements BlockingQueue<T> {
private final Notifier notifier; // injected not null
…
#Override public void put(T element) {
if (getDelegate().isEmpty()) {
notifier.notEmptyAnymore();
}
super.put(element);
}
#Override public T poll() {
final T result = super.poll();
if ((result != null) && getDelegate().isEmpty())
notifier.nowEmpty();
}
… etc
}
This approach though has a couple of problems. While the empty -> notEmpty is pretty straightforward – particularly for a single producer case, it would be easy for two consumers to run concurrently and both see the queue go from non-empty -> empty.
If though, all you want is to be notified that the queue became empty at some time, then this will be enough as long as your notifier is your state machine, tracking emptiness and non-emptiness and notifying when it changes from one to the other:
AtomicStateNotifier implements Notifier {
private final AtomicBoolean empty = new AtomicBoolean(true); // assume it starts empty
private final Notifier delegate; // injected not null
public void notEmptyAnymore() {
if (empty.get() && empty.compareAndSet(true, false))
delegate.notEmptyAnymore();
}
public void nowEmpty() {
if (!empty.get() && empty.compareAndSet(false, true))
delegate.nowEmpty();
}
}
This is now a thread-safe guard around an actual Notifier implementation that perhaps posts tasks to an Executor to asynchronously write the events to the database.
The design is most likely flawed but you can do it relatively simple:
You have a single thread adding, so you can check before adding. i.e. pool.getQueue().isEmpty() - w/ one producer, this is safe.
Last item removed cannot be guaranteed but you can override beforeExecute and check the queue again. Possibly w/ a small timeout after isEmpty() returns true. Probably the code below will be better off executed in afterExecute instead.
protected void beforeExecute(Thread t, Runnable r) {
if (getQueue().isEmpty()){
try{
Runnable r = getQueue().poll(200, TimeUnit.MILLISECONDS);
if (r!=null){
execute(r);
} else{
//last message - or on after execute by Setting a threadLocal and check it there
//alternatively you may need to do so ONLY in after execute, depending on your needs
}
}catch(InterruptedException _ie){
Thread.currentThread().interrupt();
}
}
}
sometime like that
I can explain why doing notifications w/ the queue itself won't work well: imagine you add a task to be executed by the pool, the task is scheduled immediately, the queue is empty again and you will need notification.

Categories