I have written spark streaming job which reads data from a s3.
The job has series of mapwithstate followed by maptopair calls, like below:
JavaDStream<String> cdrLines = ssc.textFileStream(cdrInputFile);
JavaDStream<CDR> cdrRecords = cdrLines.map(x -> cdrStreamParser.parse(x));
JavaDStream<CDR> cdrRecordsFiltered = cdrRecords
.filter(t -> t != null);
JavaPairDStream<String, CDR> sTripletStream = cdrRecordsFiltered
.mapToPair(s -> new Tuple2<String, CDR>(s
.gettNumber(), s));
JavaPairDStream<String, Tuple2<CDR, List<StatusCode>>> stateDstream1 = sTripletStream
.mapWithState(
StateSpec.function(hsMappingFunc).initialState(
tripletRDD)).mapToPair(s -> s);
JavaPairDStream<String,Tuple2<CDR,List<StatusCode>>> stateDstream2 = stateDstream1
.mapWithState(StateSpec.function(cfMappingFunc).initialState(cfHistoryRDD))
.mapToPair(s -> s);
JavaPairDStream<String, Tuple2<CDR, List<StatusCode>>> stateDstream3 = stateDstream2
.mapWithState(StateSpec.function(imeiMappingFunc).initialState(imeiRDD))
.mapToPair(s -> s);
I have spark.default.parallelism set to 6. I see first and last maptopair stages are fast enough. The second and third maptopair stages are very slow.
Each of these stages run through 6 tasks. In the second and third maptopair stages, 5 tasks run with 2s. But one task is taking very long time ~3-4min. the shuffle data that task is very high compared to other tasks, which causing bottleneck.
Is there a way we can distrubute the load among all tasks more uniformly?
This is use case for CDR processing. Each CDR event has these fields telno, imei, imsi, callforward, timestamp.
I maintain 3 kinds of info in spark state: 1. last know CDR event (record) for a given telephone number 2. callforward number list for each telephone 3. list of all known imei's.
Three mapwithstate function calls corresponds to below functionality:
step1 : As the CDR events comes in, i need to do some field comparisons with last known CDR event with same telephone number. I maintain latest event for a given telno in the spark state, so that i can do field comparisons as new CDR events comes in.
step2 : For a given telno., i want to check if the callforward number is known number or not. So i need to maintain history of telno. -> list of callforward numbers in the state.
step3 : I need to maintain list of all imei numbers came across, so far in the state, so that for each imei in the CDR event, we can say if its known or new imei.
Related
I am trying to make a service that will calculate statistics for each month.
I did smth like this:
public Map<String, BigDecimal> getStatistic() {
List<Order> orders = orderService.findAll(Sort.by(Sort.Direction.ASC, "creationDate")).toList();
SortedMap<String, BigDecimal> statisticsMap = new TreeMap<>();
MathContext mc = new MathContext(3);
for (Order order : orders) {
List<FraudDishV1Response> dishesOfOrder = order.getDishIds()
.stream()
.map(dishId -> dishV1Client.getDishById(dishId))
.collect(Collectors.toList());
BigDecimal total = calculateTotal(dishesOfOrder);
String date = order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM"));
statisticsMap.merge(date, total, (a, b) -> a.add(b, mc));
}
return statisticsMap;
}
But it takes a long time if there are lots of etries in the database.
Are there any best practices for working with statistics in REST API applications?
And also I'd like to know if it is a good way to save the statistics in a separate repository? It will save time for calculating statistics, but during creating a record in the database, you will also have to update the statistics db.
With your approach you'll eventually run out of memory while trying to load huge amount of data from database. You could do processing in batches but then again it will only get you so far. Ideally any kind of statistical data or on demand reporting would be served by long running scheduled jobs which will periodically do processing in the background and generate the desired data for you. You could dump the result in a table and then serve it from there via an API.
Another approach is to do real time processing. If you could develop a streaming pipeline in your application then I would highly suggest you may explore Apache Flink project.
Well, I did't stop and made several solutions step by step...
Step 1: Use streams. Before that, calculating statistics for 10,000 OrderEntities records took 18 seconds. Now it has accelerated to 14 seconds.
Step 2: Using parallelStream instead of streams. Parallel streams accelerated the calculation of statistics to 6 seconds! I was even surprised.
public SortedMap<String, BigDecimal> getStatisticsByParallelStreams() {
List<OrderEntity> orders = new ArrayList<>();
orderService.findAll(Sort.by(Sort.Direction.ASC, "createdDate")).forEach(orders::add);
MathContext mc = new MathContext(3);
return orders.stream().collect(Collectors.toMap(
order -> order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM")),
order -> calculateTotal(order.getDishIds()
.parallelStream()
.map(dishId -> dishV1Client.getDishById(dishId))
.collect(Collectors.toList())),
(a, b) -> a.add(b, mc),
TreeMap::new
));
}
Step 3: Optimizing requests to another microservice. I connected the JProfiler to the app and I have found out that I offen do extra requests to the another microservice. After it firstly I made a request to receive all Dishes, and then during calculating statistics, I use a recieved List of Dishes.
And thus I speeded it up to 1.5 seconds!:
public SortedMap<String, BigDecimal> getStatisticsByParallelStreams() {
List<OrderEntity> orders = new ArrayList<>();
orderService.findAll(Sort.by(Sort.Direction.ASC, "createdDate")).forEach(orders::add);
List<FraudDishV1Response> dishes = dishV1Client.getDishes();
MathContext mc = new MathContext(3);
return orders.stream().collect(Collectors.toMap(
order -> order.getCreatedDate().format(DateTimeFormatter.ofPattern("yyyy-MM")),
order -> calculateTotal(order.getDishIds()
.parallelStream()
.map(dishId -> getDishResponseById(dishes, dishId))
.collect(Collectors.toList())),
(a, b) -> a.add(b, mc),
TreeMap::new
));
}
I am new to reactive world, might sound a newbee, I have a flux of product having size 20-30, and for each product i need to fetch the below from different microservices:
average review count
totalCommentCount
wishlistedCount
variants..
..
6 ..
What i have tried..
1. doOnNext
Flux<Product> products = ...
products
.doOnNext(product -> updateReviewCount)
.doOnNext(product -> updateTotalCommentCount)
.doOnNext(product -> updateWishlistedCount)
.doOnNext(product -> updateVariants)
...
This turns out to block the chain for each call for each product..
e.g.
Total records(20) * No. of service calls(5) * Time per service calls(30 ms) = 3000ms
But time will grow with the number of records || number of service calls.
2. map
using map i updated and returned same reference, but the results were same.
3. collected all as list and executed aggregate query to downstream services
Flux<Product> products = ...
products
.collectList() // Mono<List<Product>>
.doOnNext(productList -> updateReviewCountOfAllInList)
.doOnNext(productList -> updateFieldB_ForAllInList)
.doOnNext(productList -> updateFieldC_ForAllInList)
.doOnNext(productList -> updateFieldD_ForAllInList)
...
This did increase the performance, although now the downstream application has to return more data for a query, so little time increased on downstream side but that's okay.
Now with this, i was able to achieve time as below...
Total records(combined as list , so 1) * No. of service calls(5) * Time per service calls(50 ms as time increased) = 250ms
But time will grow with the number of service calls.
Now i need to parallelize these service calls and execute these service calls in parallel and update their respective fields on the same product instance (same reference).
Some like below
Flux<Product> products = ... // of 10 products
products
.collectList() // Mono<List<Product>>
.doAllInParallel(serviceCall1, serviceCall2, serviceCall3...)
. // get all updated products // flux size of 10
With that i want to achieve time... 250/5 = 50ms
How to achieve that?
I found different articles, but i am not sure on what's the best way to do it? Can someone please help me on the same.
it worked using
products // Flux<Product>
.collectList() // Mono<List<Product>>
.flatMap(list -> Mono.zip( this.call1(list) ,this.call2(list) ) ) // will return Tuple
.map(t -> t.getT1)
.flatMapIterable(i->i)
Mono<Product> call1(List<Product> productList){
// some logic
}
Mono<Product> call2(List<Product> productList){
// some logic
}
Actually zip and flatmapiterable , all could be done in a single step as well.. here's its just for demo.
I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you
Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.
I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )
I've written a stream that takes in messages and sends out a table of the keys that have appeared. If something appears, it will show a count of 1. This is a simplified version of my production code in order to demonstrate the bug. In a live run, a message is sent out for each message received.
However, when I run it in a unit test using ProcessorTopologyTestDriver, I get a different behavior. If a key that has already been seen before is received, I get an extra message.
If I send messages with keys "key1", then "key2", then "key1", I get the following output.
key1 - 1
key2 - 1
key1 - 0
key1 - 1
For some reason, it decrements the value before adding it back in. This only happens when using ProcessorTopologyTestDriver. Is this expected? Is there a work around? Or is this a bug?
Here's my topology:
final StreamsBuilder builder = new StreamsBuilder();
KGroupedTable<String, String> groupedTable
= builder.table(applicationConfig.sourceTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> KeyValue.pair(key, value), Serialized.with(Serdes.String(), Serdes.String()));
KTable<String, Long> countTable = groupedTable.count();
KStream<String, Long> countTableAsStream = countTable.toStream();
countTableAsStream.to(applicationConfig.outputTopic(), Produced.with(Serdes.String(), Serdes.Long()));
Here's my unit test code:
TopologyWithGroupedTable top = new TopologyWithGroupedTable(appConfig, map);
Topology topology = top.get();
ProcessorTopologyTestDriver driver = new ProcessorTopologyTestDriver(config, topology);
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key2", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
ProducerRecord<String, Long> outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key2", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value()); //this fails, I get 0. If I pull another message, it shows key1 with a count of 1
Here's a repo of the full code:
https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/
Stream topology: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/main/java/com/nick/kstreams/TopologyWithGroupedTable.java
Test code: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/test/java/com/nick/kstreams/TopologyWithGroupedTableTests.java
It's not a bug, but behavior by design (c.f. explanation below).
The difference in behavior is due to KTable state store caching (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html). When you run the unit test, the cache is flushed after each record, while in your production run, this is not the case. If you disable caching in your production run, I assume that it behaves the same as in your unit test.
Side remark: ProcessorTopologyTestDriver is an internal class and not part of public API. Thus, there is no compatibility guarantee. You should use the official unit-test packages instead: https://docs.confluent.io/current/streams/developer-guide/test-streams.html
Why do you see two records:
In your code, you are using a KTable#groupBy() and in your specific use case, you don't change the key. However, in general, the key might be changed (depending on the value of the input KTable. Thus, if the input KTable is changed, the downstream aggregation needs to remove/subtract the old key-value pair from the aggregation result, and add the new key-value pair to the aggregation result—in general, the key of the old and new pair are different and thus, it's required to generate two records because the subtraction and addition could happen on different instances as different keys might be hashed differently. Does this make sense?
Thus, for each update of the input KTable, two updates two the result KTable on usually two different key-value pairs need to be computed. For you specific case, in which the key does not change, Kafka Stream does the same thing (there is no check/optimization for this case to "merge" both operations into one if the key is actually the same).
From the wouldn't-it-be-cool-if category of questions ...
By "queue-like-thing" I mean supports the following operations:
append(entry:Entry) - add entry to tail of queue
take(): Entry - remove entry from head of queue and return it
promote(entry_id) - move the entry one position closer to the head; the entry that currently occupies that position is moved in the old position
demote(entry_id) - the opposite of promote(entry_id)
Optional operations would be something like:
promote(entry_id, amount) - like promote(entry_id) except you specify the number of positions
demote(entry_id, amount) - opposite of promote(entry_id, amount)
of course, if we allow amount to be positive or negative, we can consolidate the promote/demote methods with a single move(entry_id, amount) method
It would be ideal if the following operations could be performed on the queue in a distributed fashion (multiple clients interacting with the queue):
queue = ...
queue.append( a )
queue.append( b )
queue.append( c )
print queue
"a b c"
queue.promote( b.id )
print queue
"b a c"
queue.demote( a.id )
"b c a"
x = queue.take()
print x
"b"
print queue
"c a"
Are there any data stores that are particularly apt for this use case? The queue should always be in a consistent state even if multiple users are modifying the queue simultaneously.
If it weren't for the promote/demote/move requirement, there wouldn't be much of a problem.
Edit:
Bonus points if there are Java and/or Python libraries to accomplish the task outlined above.
Solution should scale extremely well.
Redis supports lists and ordered sets: http://redis.io/topics/data-types#lists
It also supports transactions and publish/subscribe messaging. So, yes, I would say this can be easily done on redis.
Update: In fact, about 80% of it has been done many times: http://www.google.co.uk/search?q=python+redis+queue
Several of those hits could be upgraded to add what you want. You would have to use transactions to implement the promote/demote operations.
It might be possible to use lua on the server side to create that functionality, rather than having it in client code. Alternatively, you could create a thin wrapper around redis on the server, that implements just the operations you want.
Python: "Batteries Included"
Rather than looking to a data store like RabbitMQ, Redis, or an RDBMS, I think python and a couple libraries have more than enough to solve this problem. Some may complain that this do-it-yourself approach is re-inventing the wheel but I prefer running a hundred lines of python code over managing another data store.
Implementing a Priority Queue
The operations that you define: append, take, promote, and demote, describe a priority queue. Unfortunately python doesn't have a built-in priority queue data type. But it does have a heap library called heapq and priority queues are often implemented as heaps. Here's my implementation of a priority queue meeting your requirements:
class PQueue:
"""
Implements a priority queue with append, take, promote, and demote
operations.
"""
def __init__(self):
"""
Initialize empty priority queue.
self.toll is max(priority) and max(rowid) in the queue
self.heap is the heap maintained for take command
self.rows is a mapping from rowid to items
self.pris is a mapping from priority to items
"""
self.toll = 0
self.heap = list()
self.rows = dict()
self.pris = dict()
def append(self, value):
"""
Append value to our priority queue.
The new value is added with lowest priority as an item. Items are
threeple lists consisting of [priority, rowid, value]. The rowid
is used by the promote/demote commands.
Returns the new rowid corresponding to the new item.
"""
self.toll += 1
item = [self.toll, self.toll, value]
self.heap.append(item)
self.rows[self.toll] = item
self.pris[self.toll] = item
return self.toll
def take(self):
"""
Take the highest priority item out of the queue.
Returns the value of the item.
"""
item = heapq.heappop(self.heap)
del self.pris[item[0]]
del self.rows[item[1]]
return item[2]
def promote(self, rowid):
"""
Promote an item in the queue.
The promoted item swaps position with the next highest item.
Returns the number of affected rows.
"""
if rowid not in self.rows: return 0
item = self.rows[rowid]
item_pri, item_row, item_val = item
next = item_pri - 1
if next in self.pris:
iota = self.pris[next]
iota_pri, iota_row, iota_val = iota
iota[1], iota[2] = item_row, item_val
item[1], item[2] = iota_row, iota_val
self.rows[item_row] = iota
self.rows[iota_row] = item
return 2
return 0
The demote command is nearly identical to the promote command so I'll omit it for brevity. Note that this depends only on python's lists, dicts, and heapq library.
Serving our Priority Queue
Now with the PQueue data type, we'd like to allow distributed interactions with an instance. A great library for this is gevent. Though gevent is relatively new and still beta, it's wonderfully fast and well tested. With gevent, we can setup a socket server listening on localhost:4040 pretty easily. Here's my server code:
pqueue = PQueue()
def pqueue_server(sock, addr):
text = sock.recv(1024)
cmds = text.split(' ')
if cmds[0] == 'append':
result = pqueue.append(cmds[1])
elif cmds[0] == 'take':
result = pqueue.take()
elif cmds[0] == 'promote':
result = pqueue.promote(int(cmds[1]))
elif cmds[0] == 'demote':
result = pqueue.demote(int(cmds[1]))
else:
result = ''
sock.sendall(str(result))
print 'Request:', text, '; Response:', str(result)
if args.listen:
server = StreamServer(('127.0.0.1', 4040), pqueue_server)
print 'Starting pqueue server on port 4040...'
server.serve_forever()
Before that runs in production, you'll of course want to do some better error/buffer handling. But it'll work just fine for rapid-prototyping. Notice that this doesn't require any locking around the pqueue object. Gevent doesn't actually run code in parallel, it just gives that impression. The drawback is that more cores won't help but the benefit is lock-free code.
Don't get me wrong, the gevent SocketServer will process multiple requests at the same time. But it switches between answering requests through cooperative multitasking. This means you have to yield the coroutine's time slice. While gevents socket I/O functions are designed to yield, our pqueue implementation is not. Fortunately, the pqueue completes it's tasks really quickly.
A Client Too
While prototyping, I found it useful to have a client as well. It took some googling to write a client so I'll share that code too:
if args.client:
while True:
msg = raw_input('> ')
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(msg)
text = sock.recv(1024)
sock.close()
print text
To use the new data store, first start the server and then start the client. At the client prompt you ought to be able to do:
> append one
1
> append two
2
> append three
3
> promote 2
2
> promote 2
0
> take
two
Scaling Extremely Well
Given your thinking about a data store, it seems you're really concerned with throughput and durability. But "scale extremely well" doesn't quantify your needs. So I decided to benchmark the above with a test function. Here's the test function:
def test():
import time
import urllib2
import subprocess
import random
random = random.Random(0)
from progressbar import ProgressBar, Percentage, Bar, ETA
widgets = [Percentage(), Bar(), ETA()]
def make_name():
alphabet = 'abcdefghijklmnopqrstuvwxyz'
return ''.join(random.choice(alphabet)
for rpt in xrange(random.randrange(3, 20)))
def make_request(cmds):
sock = gsocket.socket(gsocket.AF_INET, gsocket.SOCK_STREAM)
sock.connect(('127.0.0.1', 4040))
sock.sendall(cmds)
text = sock.recv(1024)
sock.close()
print 'Starting server and waiting 3 seconds.'
subprocess.call('start cmd.exe /c python.exe queue_thing_gevent.py -l',
shell=True)
time.sleep(3)
tests = []
def wrap_test(name, limit=10000):
def wrap(func):
def wrapped():
progress = ProgressBar(widgets=widgets)
for rpt in progress(xrange(limit)):
func()
secs = progress.seconds_elapsed
print '{0} {1} records in {2:.3f} s at {3:.3f} r/s'.format(
name, limit, secs, limit / secs)
tests.append(wrapped)
return wrapped
return wrap
def direct_append():
name = make_name()
pqueue.append(name)
count = 1000000
#wrap_test('Loaded', count)
def direct_append_test(): direct_append()
def append():
name = make_name()
make_request('append ' + name)
#wrap_test('Appended')
def append_test(): append()
...
print 'Running speed tests.'
for tst in tests: tst()
Benchmark Results
I ran 6 tests against the server running on my laptop. I think the results scale extremely well. Here's the output:
Starting server and waiting 3 seconds.
Running speed tests.
100%|############################################################|Time: 0:00:21
Loaded 1000000 records in 21.770 s at 45934.773 r/s
100%|############################################################|Time: 0:00:06
Appended 10000 records in 6.825 s at 1465.201 r/s
100%|############################################################|Time: 0:00:06
Promoted 10000 records in 6.270 s at 1594.896 r/s
100%|############################################################|Time: 0:00:05
Demoted 10000 records in 5.686 s at 1758.706 r/s
100%|############################################################|Time: 0:00:05
Took 10000 records in 5.950 s at 1680.672 r/s
100%|############################################################|Time: 0:00:07
Mixed load processed 10000 records in 7.410 s at 1349.528 r/s
Final Frontier: Durability
Finally, durability is the only problem I didn't completely prototype. But I don't think it's that hard either. In our priority queue, the heap (list) of items has all the information we need to persist the data type to disk. Since, with gevent, we can also spawn functions in a multi-processing way, I imagined using a function like this:
def save_heap(heap, toll):
name = 'heap-{0}.txt'.format(toll)
with open(name, 'w') as temp:
for val in heap:
temp.write(str(val))
gevent.sleep(0)
and adding a save function to our priority queue:
def save(self):
heap_copy = tuple(self.heap)
toll = self.toll
gevent.spawn(save_heap, heap_copy, toll)
You could now copy the Redis model of forking and writing the data store to disk every few minutes. If you need even greater durability then couple the above with a system that logs commands to disk. Together, those are the AFP and RDB persistence methods that Redis uses.
Websphere MQ can do almost all of this.
The promote/demote is almost possible, by removing the message from the queue and putting it back on with a higher/lower priority, or, by using the "CORRELID" as a sequence number.
What's wrong with RabbitMQ? It sounds exactly like what you need.
We extensively use Redis as well in our Production environment, but it doesn't have some of the functionality Queues usually have, like setting a task as complete, or re-sending the task if it isn't completed in some TTL. It does, on the other hand, have other features a Queue doesn't have, like it is a generic storage, and it is REALLY fast.
Use Redisson it implements familiar List, Queue, BlockingQueue, Deque java interfaces in distributed approach provided by Redis. Example with a Deque:
Redisson redisson = Redisson.create();
RDeque<SomeObject> queue = redisson.getDeque("anyDeque");
queue.addFirst(new SomeObject());
queue.addLast(new SomeObject());
SomeObject obj = queue.removeFirst();
SomeObject someObj = queue.removeLast();
redisson.shutdown();
Other samples:
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#77-list
https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#78-queue https://github.com/mrniko/redisson/wiki/7.-distributed-collections/#710-blocking-queue
If you for some reason decide to use an SQL database as a backend, I would not use MySQL as it requires polling (well and would not use it for lots of other reasons), but PostgreSQL supports LISTEN/NOTIFY for signalling other clients so that they do not have to poll for changes. However, it signals all listening clients at once, so you still would require a mechanism for choosing a winning listener.
As a sidenote I am not sure if a promote/demote mechanism would be useful; it would be better to schedule the jobs appropriately while inserting...