Jedis pipeline execute commands before calling sync method? - java

I am using the Jedis pipeline to add multiple records in Redis at once. But when I am debugging the code I can see the records appear in Redis even before calling jedis.sync() method. Aren't all commands in the pipeline expected to be executed only after that? Or maybe it's just batching them on some chunks with a fixed size?
var pipeline = jedis.pipelined();
all.forEach(value -> pipeline.sadd(allPrefix, value));
grouped.forEach((key, value) -> pipeline.hset(groupedPrefix, String.valueOf(key), value));
pipeline.sync();
Am I doing it the right way and what is the reason for this behavior?

Jedis pipeline writes the commands in the socket buffer. If the size of all commands exceeds the size of that buffer, it is flushed to make space for more commands. In the mean time, those flushed commands reach to Redis server and the server may start processing those commands; even though pipeline.sync() is not called.
pipeline.sync() ensures that all commands would be sent to server. It does not ensure all commands would be kept in buffer until sync() is called.
If you want something where none of your commands would be executed before a trigger, consider (any variant of) Transaction. All the commands in a transaction gets executed only after exec() is called.

Related

Redis lua script doesn't respect automicity in some case

I'm trying to have a implement a simple producer-consumer using Java and Redis.
The flow is this: the producer pushes items into Redis. The consumer reads the items from Redis. In order to make the consumer more efficient - it will not read every item alone, but instead will read batch of items. For that, I'm using this flow:
Producer push the item to a pending set in Redis.
When the set count exceeds a given threshold, the items are packed into a batch (using JSON) and being saved into a "ready hashmap".
The consumer reads the ready hashmap and evaluate its content.
The consumer removes the items it consumed from the hashmap.
As these operations might cause race conditions, I looked into transactional operations. I understood that the best way to achieve it is with Lua, as redis-lua scripting is atomic.
The script I wrote is this:
local toJsonList = function(items)
local jsonList = '[' .. table.concat(items, ', ') .. ']'
return cjson.encode(jsonList)
end
-- Get the batch size (the threshold)
local batchSize = tonumber(ARGV[1])
-- Add the new item
redis.call('SADD', KEYS[1], ARGV[2])
-- Get the number of pending items in Redis
local currentPendingQueueItems = redis.call('SCARD', KEYS[1])
-- Should we move the items from the pending queue to the ready queue?
if currentPendingQueueItems < batchSize then
return 1
end
-- Fetch the items stored in the pending queue
local pendingItems = redis.call('SMEMBERS', KEYS[1])
-- Store the items in the ready queue hash map
redis.call('HSET', KEYS[3], cjson.encode(KEYS[4]), toJsonList(pendingItems))
-- Remove the pending queue
redis.call('DEL', KEYS[1])
return 1
The execution is like so:
$ redis-cli --eval addAndSync.lua "pending-queue-key" "ready-queue-key" "unique-key-for-batch" , $THRESHOLD "item to add"
I started by testing it out individually, and it indeed works fine. The ready queue is synced correctly. I even wrote this script:
#!/bin/bash
END=$1
for i in $(seq 1 $END);
do redis-cli --eval syncReadyQueue.lua "pending" "ready" "ready${i}" , "3" "${i}" &
done
Which I ran with END=100 to test some insertions at once.
My issue is, after integrating it with Java, I started to stress test it. When stress testing and few threads fired at the same time to handle the produced content - each of then ran addAndSync.lua. I inspected the ready queue after insertion of even only 30 records and noticed that there are duplicates in the ready queue.
That's unexpected to me, as Redis guaranteed that Lua scripts will block any other client call.
My expected behaviour is that every Lua call will block the access to Redis until it will commit its results, and thus the ready queue will have unique items.
I'd love to get any help in understanding what am I missing.
Thanks!!

Is there any way we can pause kafka stream for certain period and resume later?

We have one requirement where we are using Kafka Streams to read from Kafka topic and then send the data over network through a pool of sessions. However, sometimes, network calls are bit slow and we need to frequently pause the stream, ensure we are not overloading network. Currently, we capture data into a stream and load it to a executor service and then send it over network through session pool.
If data in executor service is too high, we need to pause the stream for some time and then resume it once backlog on executor service is cleared up. For achiveing this pause mechanism, We are currently closing the stream and starting again once backlog is cleared up.
Is there any way we can pause the kafka stream?
If I understand you correctly, there is nothing special you need to do. You are talking about "back pressure" and Kafka Streams can handle it out of the box.
What can be done is putting this data into a queue with some max size and use this queue to load in executor service. Whenever the queue reaches some threshold, there are two methods:
If your call to put data in queue is blocking with no time-out, there is nothing more you need to do. Just wait until the system is back online, your call
returns, and processing will resume.
If your call to put data in queue is blocking with time-out,just issue the lookup to check the size of the queue. Repeat this until the system is back online and your call succeeds.
The only caveat is that as long as your Streams application blocks, the internally used Kafka consumer client will not send any heartbeats to Kafka and might time out. Thus, you need to set the time-out configuration parameter higher than the expected maximum downtime of your external system.
Another approach is to use a Processor API available in Kafka-streams, though, it is not usually recommended pattern.
Let me know if it helps!!

Could I use igniteQueue in another cache.invoke() function?

Could I use igniteQueue in another cache.invoke() function?
In Ignite Service A's excute function:
cacheA.invoke(record){ // do process to record
igniteQueue.put(processed_record);
}
In Ignite Service B's excute function:
saved_processed_record = igniteQueue.take();
It runs smoothly when TPS is low, but when i running with high TPS, some times I get "Possible starvation in striped pool" in log,
See my previous post:
Ignite service hangs when call cache remove in another cache's invoke processor, " Possible starvation in striped pool"?
It seems I use igniteQueue in cache.invoke is also not correct vs. use ignite cache in cache.invoke?
So if i could not use ignite queue in a cache.invoke(), is there a better way to do so? I have try to use another message queue(kafka or redis) instead ignite queue in the cache, but we know Ignite say it is also a message queue, using kafka in ignite invoke seems very strange, how could i use pure ignite to achive this?
You should not issue any blocking operations from "invoke(..)" method, as it executes within a lock on the key. Instead, why not create another thread pool and have it be responsible for adding and taking objects from the IgniteQueue. Then you can simply submit a task to that thread pool from the "invoke(..)" method and inside that task enqueue an object.

How to implement transaction with rollback in Redis

My program needs to add data to two lists in Redis as a transaction. Data should be consistent in both lists. If there is an exception or system failure and thus program only added data to one list, system should be able to recover and rollback. But based on Redis doc, it doesn't support rollback. How can I implement this? The language I use is Java.
If you need transaction rollback, I recommend using something other than Redis. Redis transactions are not the same as for other datastores. Even Multi/Exec doesn't work for what you want - first because there is no rollback. If you want rollback you will have to pull down both lists so you can restore - and hope that between our error condition and the "rollback" no other client also modified either of the lists. Doing this in a sane and reliable way is not trivial, nor simple. It would also probably not be a good question for SO as it would be very broad, and not Redis specific.
Now as to why EXEC doesn't do what one might think. In your proposed scenario MULTI/EXEC only handles the cases of:
You set up WATCHes to ensure no other changes happened
Your client dies before issuing EXEC
Redis is out of memory
It is entirely possible to get errors as a result of issuing the EXEC command. When you issue EXEC, Redis will execute all commands in the queue and return a list of errors. It will not provide the case of the add-to-list-1 working and add-to-list-2 failing. You would still have your two lists out of sync. When you issue, say an LPUSH after issuing MULTI, you will always get back an OK unless you:
a) previously added a watch and something in that list changed or
b) Redis returns an OOM condition in response to a queued push command
DISCARD does not work like some might think. DISCARD is used instead of EXEC, not as a rollback mechanism. Once you issue EXEC your transaction is completed. Redis does not have any rollback mechanism at all - that isn't what Redis' transaction are about.
The key to understanding what Redis calls transactions is to realize they are essentially a command queue at the client connection level. They are not a database state machine.
Redis transactions are different. It guarantees two things.
All or none of the commands are executed
sequential and uninterrupted commands
Having said that, if you have the control over your code and know when the system failure would happen (some sort of catching the exception) you can achieve your requirement in this way.
MULTI -> Start transaction
LPUSH queue1 1 -> pushing in queue 1
LPUSH queue2 1 -> pushing in queue 2
EXEC/DISCARD
In the 4th step do EXEC if there is no error, if you encounter an error or exception and you wanna rollback do DISCARD.
Hope it makes sense.

Writing Spark Streaming Output to a Socket

I have a DStream "Crowd" and I want to write each element in "Crowd" to a socket. When I try to read from that socket, it dosen't print anything. I am using following line of code:
val server = new ServerSocket(4000,200);
val conn = server.accept()
val out = new PrintStream(conn.getOutputStream());
crowd.foreachRDD(rdd => {rdd.foreach(record=>{out.println(record)})})
But if use (this is not I want though):
crowd.foreachRDD(rdd => out.println(rdd))
It does write something to the socket.
I suspect there is a problem with using rdd.foreach(). Although it should work. I am not sure what I am missing.
The code outside the DStream closure is executed in the driver, while the rdd.foreach(...) will be executed on each distributed partition of the RDD.
So, there's a socket created on the driver's machine and the job tries to write to it on another machine - that will not work for the obvious reasons.
DStream.foreachRDD is executed on the driver, so in that instance, the socket and the computation are performed in the same host. Therefore it works.
With the distributed nature of an RDD computation, this Server Socket approach will be hard to make work as dynamic service discovery becomes a challenge i.e. "where is my server socket open?". Look into some system that will allow you to have centralized access to distributed data. Kafka is a good alternative for this kind of streaming process.
Here in the official documentation you have the answer!
You have to create the connection inside of the foreachRDD function, and if you want to do it optimally you need to create a "pool" of connections, and then bring the connection you want inside of the foreachPartition function, and call to the foreach function to send the elements through that connection. This is the example code for doing it in the best way:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
In any case, check the other comments as they provide good knowledge about the context of the problem.
crowd.foreachRDD(rdd => {rdd.collect.foreach(record=>{out.println(record)})})
Your suggested code in your comments will work fine but in this case you have to collect all records of RDD in driver. If number of records are small that will be ok but if number of records are larger than the driver's memory that will be become bottle neck. Your first attempt should always process data on client. Remember RDD is distributed on worker machines so that means first you need to bring all records in RDD to driver resulting in increased communication which is a kill in distributed computing. So as stated your code will only be ok when there are limited records in RDD.
I am working on similar problems and I have been searching how to pool connections and serialize them to client machines. If some body has any answers to that, will be great.

Categories