Cassandra showing abrupt behaviour

Cassandra showing abrupt behaviour - java

I have one unit test in which I am writing and reading from cassandra multiple times.
future = function(Cassandra update with value x) - async //write and read updated value
value = future.get(); //reading
print value;
assert value == x;
//doing the above operation multiple times with different values of x;
Running the same code multiple times is showing different results i.e printing different result for 'value' attribute.
I am using cassandra on local host with
replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
It's worth noting that I am writing and reading at a same row in the table (in all read and write, primary key is same).
And though I am modifying the same object multiple times, but they are supposed to run sequentially as I am running blocking function future.get() after every update statement.
I am using Cassandra 2.0.14 with datastax driver and jdk 1.8.
Any ideas why I must be facing such behaviour?

Figured out the reason.
In my code (not the test code), I wasn't actually writing and reading sequentially. Read wasn't waiting for the write to be completed.
What I was doing:
`CompletionStage<Void> Function() {
someOperation
.thenAccept(variable -> AsyncWriteInDb(variable));
}
// AsyncWriteInDb returns CompletionStage<Void> when write is completed.
// I was reading just after execution of this function.
`
What I should be doing:
` CompletionStage<Void> Function() {
someOperation
.thenCompose(variable -> AsyncWriteInDb(variable));
}
//AsyncWriteInDb returns CompletionStage<Void> when write is completed.
`
It's easier to understand if I write earlier (wrong) code part as below:
`CompletionStage<Void> Function() {
someOperation
.thenAccept(variable -> {
AsyncWriteInDb(variable);
return;
});
}
// thenAccept's lamda was returning after initiating an asyncDbWrite.
// Reading just after this doesnt ensure sequential reading after writing.
`

Related

How to call multiple Uni concurrently

Recently, I'm working on a project where I have 2 make 2 asynchronous calls at the same time. Since I'm working with Quarkus, I ended up trying to make use of Mutiny and the vert.x library. However, I can not get my code working with Unis. In the below code, I would imagine that both Unis would be called and the Uni that returns fastest would be returned. However, it seems that when combining Unis it simply returns the first one in the list, even though the first uni should take a longer time.
The below code prints out one one when it should print out two two since the uniFast should finish first. How do I combine Unis and have the faster one return first?
#Test
public void testUniJion(){
var uniSLow = Uni.createFrom().item(() -> {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
return "one";
});
var uniFast = Uni.createFrom().item(() -> {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
return "two";
});
var resp = Uni.join().first(uniSLow,uniFast).withItem().await().indefinitely();
System.out.println(resp);
var resp2 = Uni.combine().any().of(uniSLow,uniFast).await().indefinitely();
System.out.println(resp2);
}
Note: This is not the actual code I am trying to implement. In my code, I am trying to fetch from 2 different databases. However, one database often has a lot more latency than the other. However, Uni seems to always wait for the slower database. I'm simply trying to understand Mutiny and Uni's better so I made this code example.

The problem is that you are not telling Mutiny on which thread should run each uni. If I add a System.out to your example:
// Slow and Fast for the different Uni
System.out.println( "Slow - " + Thread.currentThread().getId() + ":" + Thread.currentThread().getName() );
I get the following output:
Slow - 1:Test worker
one
Slow - 1:Test worker
Fast - 1:Test worker
one
The output shows that everything runs on the same thread and therefore when we block the first one, the second one is blocked too.
That's why the output is one one.
One way to run the uni in parallel is to use a different executor at subscription:
ExecutorService executorService = Executors.newFixedThreadPool( 5 );
uniSlow = uniSlow.runSubscriptionOn( executorService );
uniFast = uniFast.runSubscriptionOn( executorService );
Now, when I run the test, I have the expected output:
Slow - 16:pool-3-thread-1
Fast - 17:pool-3-thread-2
two
Slow - 18:pool-3-thread-3
Fast - 19:pool-3-thread-4
two
Note that this time Slow and Fast are running on different threads.
The Mutiny guide has a section about the difference between emitOn vs. runSubscriptionOn and some examples on how to change the emission thread.

Reactor's StepVerifier: assertion fails in nondeterministic way on various step

I want to use StepVerifier in integration testing save operations in Mongo repository.
I prepared a method for inserting multiple UserItems for further verification:
Flux<UserItems> saveMultiple(int numberOfItems) {
return Flux.range(0, numberOfItems)
.flatMap { userItemsRepository.save(new UserItem(it)) }
}
userItemsRepository.save returns Mono<UserItem>
I prepared a test method:
def "Should save all UserItems"() {
given:
def numberOfItems = 3
when:
def saveResult = saveMultiple(numberOfItems)
then:
StepVerifier.create(saveResult)
.expectNextMatches {it.itemNo == 0 }
.expectNextMatches {it.itemNo == 1 }
.expectNextMatches {it.itemNo == 2 }
.expectComplete()
.verify()
}
And I expect that next items will emerge in the order {0,1,2}. Unfortunately, the test fails because of java.lang.AssertionError in non deterministic way, on various step. I cannot figure out how to do it properly. It's my first approach to test Reactor flow. Anyone has an idea, how to handle such situations?

The flatMap operator doesn't preserve order of the source and lets values from different inners interleave.
So depending on userItemsRepository.save you can have something like:
1--2--3--4
flatMap
UserItem2--UserItem4--UserItem1--UserItem3
if interleaving doesn't bother you but want to keep the original order you can use flatMapSequencial or if you don't want any interleave concatMap

Groovy gmongo batch processing

I'm currently trying to run a batch processing job in groovy with Gmongo driver, the collection is about 8 gigs my problem is that my script tries to load everything in-memory, ideally I'd like to be able to process this in batch similar to what Spring Boot Batch does but in groovy scripts
I've tried batchSize(), but this function still retrieves the entire collection into memory only to apply it to my logic in batch-process.
here's my example
momngoDb.collection.find().collect() it -> {
//logic
}

according to official doc:
https://docs.mongodb.com/manual/tutorial/iterate-a-cursor/#read-operations-cursors
def myCursor = db.collection.find()
while (myCursor.hasNext()) {
print( myCursor.next() }
}

After deliberation I found this solution to works best for the following reasons.
Unlike the Cursor it doesn't retrieve documents on a singular basis for processing (which can be terribly slow)
Unlike the Gmongo batch funstion, it also doesn't try to upload the the entire collection in memory only to cut it up in batches for process, this tends to be heavy on machine resources.
code below is efficient and light on resource depending on your batch size.
def skipSize = 0
def limitSize = Integer.valueOf(1000) batchSize (if your going to hard code the batch size then you dont need the int convertion)
def dbSize = Db.collectionName.count()
def dbRunCount = (dbSize / limitSize).round()
dbRunCount.times { it ->
dstvoDsEpgDb.schedule.find()
.skip(skipSize)
.limit(limitSize)
.collect { event ->
//run your business logic processing
}
//calculate the next skipSize
skipSize += limitSize
}

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).

The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

Spark RDD loaded into LevelDB via toLocalIterator creates corrupted database

I have a Spark computation that I want to persist into a simple leveldb database - once all the heavy lifting is done by Spark (in Scala here).
So my code goes like this :
private def saveRddToLevelDb(rdd: RDD[(String, Int)], target: File) = {
import resource._
val options = new Options()
options.createIfMissing(true)
options.compressionType(CompressionType.SNAPPY)
for (db <- managed(factory.open(target, options))) { // scala-arm
rdd.map { case (key, score) =>
(bytes(key), bytes(score.toString))
}.toLocalIterator.foreach { case (key, value) =>
db.put(key, value)
}
}
}
And all is right with the world. But then if I try to open the created database and do a get on it :
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: .../leveldb-data/000081.sst: Invalid argument
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:316)
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:300)
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:293)
org.fusesource.leveldbjni.internal.JniDB.get(JniDB.java:85)
org.fusesource.leveldbjni.internal.JniDB.get(JniDB.java:77)
I managed however to make it by, not simply opening the created leveldb database, but repairing it beforehand... (in java this time) :
factory.repair(new File(levelDbDirectory, "leveldb-data"), options);
DB db = factory.open(new File(levelDbDirectory, "leveldb-data"), options);
So, everything's all right then ?!
Yes, but my only question is why ?
What am I doing wrong when I put all my data into leveldb :
the open-stream to the database is managed by scala-arm, therefore closed properly afterwards
My JVM is not killed or anything
There's only one process, heck even only one thread - the driver one, accessing the database (via the toLocalIterator method)
and finally, if I open the database using the paranoid mode, leveldb doesn't care before I try to get on it. So the database is not exactly corrupted in its eyes.
I've read about the fact that the put write is actually async, I did not however try to change the WriteOptions to synced, but wouldn't the close method wait for the process to flush everything ?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cassandra showing abrupt behaviour - java

Related

How to call multiple Uni concurrently

Reactor's StepVerifier: assertion fails in nondeterministic way on various step

Groovy gmongo batch processing

Can Spark Streaming do Anything Other Than Word Count?

Spark RDD loaded into LevelDB via toLocalIterator creates corrupted database

Categories

Resources