Groovy gmongo batch processing - java

I'm currently trying to run a batch processing job in groovy with Gmongo driver, the collection is about 8 gigs my problem is that my script tries to load everything in-memory, ideally I'd like to be able to process this in batch similar to what Spring Boot Batch does but in groovy scripts
I've tried batchSize(), but this function still retrieves the entire collection into memory only to apply it to my logic in batch-process.
here's my example
momngoDb.collection.find().collect() it -> {
//logic
}

according to official doc:
https://docs.mongodb.com/manual/tutorial/iterate-a-cursor/#read-operations-cursors
def myCursor = db.collection.find()
while (myCursor.hasNext()) {
print( myCursor.next() }
}

After deliberation I found this solution to works best for the following reasons.
Unlike the Cursor it doesn't retrieve documents on a singular basis for processing (which can be terribly slow)
Unlike the Gmongo batch funstion, it also doesn't try to upload the the entire collection in memory only to cut it up in batches for process, this tends to be heavy on machine resources.
code below is efficient and light on resource depending on your batch size.
def skipSize = 0
def limitSize = Integer.valueOf(1000) batchSize (if your going to hard code the batch size then you dont need the int convertion)
def dbSize = Db.collectionName.count()
def dbRunCount = (dbSize / limitSize).round()
dbRunCount.times { it ->
dstvoDsEpgDb.schedule.find()
.skip(skipSize)
.limit(limitSize)
.collect { event ->
//run your business logic processing
}
//calculate the next skipSize
skipSize += limitSize
}

Related

How to measure execution time of an aync query/request inside Kotlin coroutines

I have a microservice on which I am using Kotlin coroutines to perform a bunch of db queries asynchronously, and I want to monitor the execution time for each one of those queries for potential performance optimization.
The implementation I have is like this:
val requestSemaphore = Semaphore(5)
val baseProductsNos = productRepository.getAllBaseProductsNos()
runBlocking {
baseProductsNos
.chunked(500)
.map { batchOfProductNos ->
launch {
requestSemaphore.withPermit {
val rawBaseProducts = async {
productRepository.getBaseProducts(batchOfProductNos)
}
val mediaCall = async {
productRepository.getProductMedia(batchOfProductNos)
}
val productDimensions = async {
productRepository.getProductDimensions(batchOfProductNos)
}
val allowedCountries = async {
productRepository.getProductNosInCountries(batchOfProductNos, countriesList)
}
val variants = async {
productRepository.getProductVariants(batchOfProductNos)
}
// here I wait for all the results and then some processing on thm
}
}
}.joinAll()
}
As you can see I use Semaphore to limit the number of parallel jobs, and all the repository methods are suspendable and those are the ones I want to measure the execution time for. Here is an example of an implementation inside ProductRepository:
suspend fun getBaseProducts(baseProductNos: List<String>): List<RawBaseProduct> =
withContext(Dispatchers.IO) {
namedParameterJdbcTemplateMercator.query(
getSqlFromResource(baseProductSql),
getNamedParametersForBaseProductNos(baseProductNos),
RawBaseProductRowMapper()
)
}
And to do that I tried this :
val rawBaseProductsCall = async {
val startTime = System.currentTimeMillis()
val result = productRepository.getBaseProducts(productNos)
val endTime = System.currentTimeMillis()
logger.info("${TemporaryLog("call-duration", "rawBaseProductsCall", endTime - startTime)}")
result
}
But this measurement always returns inconsistent results for the average in contrast to the sequential implementation(without coroutines), and the only explanation I can come up with is that this measurement includes the suspension time, and obviously I am only interested in the time that the queries take to execute without a suspension time if there was any.
I don't know if what I am trying to do is possible in Kotlin, but it looks like python supports this. So I will appreciate any help to do something similar in Kotlin.
UPDATE:
In my case I am using a regular java library to query the db, so my DB queries are just regular blocking calls which means that the way I am measuring time right now is correct.
The assumption I made in the question would have been valid if I was using some implementation of R2DBC for querying my DB.
you do not want to measure the coroutine startup or suspension time so you need to measure over a block of code that will not suspend, ie.. your database calls from a java library
stdlib for example provides a few nice functions like measureTimedValue
val (result, duration) = measureTimedValue {
doWork()
// eg: productRepository.getBaseProducts(batchOfProductNos)
}
logger.info("operation took $duration")
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.time/measure-timed-value.html
I don't know if this is intentional or by a mistake, but you use only a single thread here. You start tens or even hundreds of coroutines and they all fight each other for this single thread. If you perform any CPU-intensive processing in "here I wait for all the results and then some processing on thm" then while it is working, all other coroutines have to wait to be resumed from withContext(Dispatchers.IO). If you want to utilize multiple threads, replace runBlocking {} with runBlocking(Dispatchers.Default) {}.
Still, it doesn't fix the problem, but rather lessens its impact. Regarding the proper fix: if you need to measure the time spent in the IO only then... measure the time in the IO only. Just move your measurements inside withContext(Dispatchers.IO) and I think results will be closer to what you expect. Otherwise, it is like measuring the size of a room by standing outside the building.

Using Pmml model in Spark

I have a PMML model that was exported from Python and I'm using that in Spark for downstream processing. Since the jpmml Evaluator isn't serializable, I'm using it inside mapPartitions. This works fine but takes a while to complete, as the mapPartition would have to materialize the iterator and collect/build the new RDD. I'm wondering if there's a more optimal way to execute the Evaluator.
I've noticed that when Spark is executing this rdd, my CPU is under utilized (drops to ~30%). Also from the SparkUI, the TaskTime (GC Time) is Red at 53s/15s
JavaRDD<List<ClassifiedPojo>> classifiedRdd = toBeClassifiedRdd.mapPartitions( r -> {
// initialized JPMML evaluator
List<ClassifiedPojo> list;
while(r.hasNext()){
// classify
list.add(new ClassifiedPojo())
}
return list.iterator();
});
Finally! I had to do 2 things.
First, I had to fix the SAX Locator by running this:
LocatorNullifier locatorNullifier = new LocatorNullifier();
locatorNullifier.applyTo(pmml);
Second, I refactored my mapPartitions to use Streams, details here.
This gave me a big boost. Hope it helps

How to increase performance of Groovy?

I'm using Groovy to execute some piece of Java code.
For my purpose Groovy it's easy to use since the code I have to execute has an arbitrary number of params that I cannot predict, since it depends on the user input.
The input I'm talking about is OWL axioms, that are nested.
This is my code:
//The reflection
static void reflectionToOwl() {
Binding binding = new Binding(); //155 ms
GroovyShell shell = new GroovyShell(binding);
while (!OWLMapping.axiomStack.isEmpty()) {
String s = OWLMapping.axiomStack.pop();
shell.evaluate(s); //350 ms
}
}
The only bottleneck in my program is exactly here. More is the data I have to process more is the ms I have to wait for.
Do you have any suggestions?
If you need to increase Groovy performance, you can use #CompileStatic annotation.
This will let the Groovy compiler use compile time checks in the style of Java then perform static compilation, thus bypassing the Groovy meta object protocol.
Just annotate specific method with it. But be sure that you don't use any dynamic features in that scope.
As an example:
import groovy.transform.CompileStatic
#CompileStatic
class Static {
}
class Dynamic {
}
println Static.declaredMethods.length
Static.declaredMethods.collect { it.name }.each { println it }
println('-' * 100)
println Dynamic.declaredMethods.length
Dynamic.declaredMethods.collect{ it.name }.each { println it }
Won't generate some extra methods:
6
invokeMethod
getMetaClass
setMetaClass
$getStaticMetaClass
setProperty
getProperty
8
invokeMethod
getMetaClass
setMetaClass
$getStaticMetaClass
$getCallSiteArray
$createCallSiteArray
setProperty
getProperty
Like the first answer indicated, #CompileStatic would have been the first option on my list of tricks as well.
Depending on your use case, pre-parsing the script expressions and calling 'run' on them execution time might be an option here. The following code demonstrates the idea:
def exprs = [
"(1..10).sum()",
"[1,2,3].max()"
]
def shell = new GroovyShell()
def scripts = time("parse exprs") {
exprs.collect { expr ->
shell.parse(expr) // here we pre-parse the strings to groovy Script instances
}
}
def standardBindings = [someKey: 'someValue', someOtherKey: 'someOtherValue']
scripts.eachWithIndex { script, i ->
time("run $i") {
script.binding = new Binding(standardBindings)
def result = script.run() // execute the pre-parsed Script instance
}
}
// just a small method for timing operations
def time(str, closure) {
def start = System.currentTimeMillis()
def result = closure()
def delta = System.currentTimeMillis() - start
println "$str took $delta ms -> result $result"
result
}
which prints:
parse exprs took 23 ms -> result [Script1#1165b38, Script2#4c12331b]
run 0 took 7 ms -> result 55
run 1 took 1 ms -> result 3
on my admittedly aging laptop.
The above code operates in two steps:
parse the String expressions into Script instances using shell.parse. This can be done in a background thread, on startup or otherwise while the user is not waiting for results.
"execution time" we call script.run() on the pre-parsed script instances. This should be faster than calling shell.evaluate.
The takeaway here is that if your use case allows for pre-parsing and has a need for runtime execution speed, it's possible to get quite decent performance with this pattern.
An example application I have used this in is a generic feed file import process where the expressions were customer editable data mapping expressions and the data was millions of lines of product data. You parse the expressions once and call script.run millions of times. In this kind of scenario pre-parsing saves a lot of cycles.
Insted of Groovy you can also use BeanShell.
It is supereasy to use and it is very light:
Website
Probably not all Java function are supported, but just give a try.

Spark RDD loaded into LevelDB via toLocalIterator creates corrupted database

I have a Spark computation that I want to persist into a simple leveldb database - once all the heavy lifting is done by Spark (in Scala here).
So my code goes like this :
private def saveRddToLevelDb(rdd: RDD[(String, Int)], target: File) = {
import resource._
val options = new Options()
options.createIfMissing(true)
options.compressionType(CompressionType.SNAPPY)
for (db <- managed(factory.open(target, options))) { // scala-arm
rdd.map { case (key, score) =>
(bytes(key), bytes(score.toString))
}.toLocalIterator.foreach { case (key, value) =>
db.put(key, value)
}
}
}
And all is right with the world. But then if I try to open the created database and do a get on it :
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: .../leveldb-data/000081.sst: Invalid argument
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:316)
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:300)
org.fusesource.leveldbjni.internal.NativeDB.get(NativeDB.java:293)
org.fusesource.leveldbjni.internal.JniDB.get(JniDB.java:85)
org.fusesource.leveldbjni.internal.JniDB.get(JniDB.java:77)
I managed however to make it by, not simply opening the created leveldb database, but repairing it beforehand... (in java this time) :
factory.repair(new File(levelDbDirectory, "leveldb-data"), options);
DB db = factory.open(new File(levelDbDirectory, "leveldb-data"), options);
So, everything's all right then ?!
Yes, but my only question is why ?
What am I doing wrong when I put all my data into leveldb :
the open-stream to the database is managed by scala-arm, therefore closed properly afterwards
My JVM is not killed or anything
There's only one process, heck even only one thread - the driver one, accessing the database (via the toLocalIterator method)
and finally, if I open the database using the paranoid mode, leveldb doesn't care before I try to get on it. So the database is not exactly corrupted in its eyes.
I've read about the fact that the put write is actually async, I did not however try to change the WriteOptions to synced, but wouldn't the close method wait for the process to flush everything ?

Mapreduce - sequence jobs?

I am using MapReduce (just map, really) to do a data processing task in four phases. Each phase is one MapReduce job. I need them to run in sequence, that is, don't start phase 2 until phase 1 is done, etc. Does anyone have experience doing this that can share?
Ideally we'd do this 4-job sequence overnight, so making it
cron-able would be a fine thing as well.
thank you
As Daniel mentions, the appengine-pipeline library is meant to solve this problem. I go over chaining mapreduce jobs together in this blog post, under the section "Implementing your own Pipeline jobs".
For convenience, I'll paste the relevant section here:
Now that we know how to launch the predefined MapreducePipeline, let’s take a look at implementing and running our own custom pipeline jobs. The pipeline library provides a low-level library for launching arbitrary distributed computing jobs within appengine, but, for now, we’ll talk specifically about how we can use this to help us chain mapreduce jobs together. Let’s extend our previous example to also output a reverse index of characters and IDs.
First, we define the parent pipeline job.
class ChainMapReducePipeline(mapreduce.base_handler.PipelineBase):
def run(self):
deduped_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_combiner",
"main.map",
"main.reduce",
"mapreduce.input_readers.RandomStringInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
combiner_spec="main.combine",
mapper_params={
"string_length": 1,
"count": 500,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16))
char_to_id_index_blob_key = (
yield mapreduce.mapreduce_pipeline.MapreducePipeline(
"test_chain",
"main.map2",
"main.reduce2",
"mapreduce.input_readers.BlobstoreLineInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
# Pass output from first job as input to second job
mapper_params=(yield BlobKeys(deduped_blob_key)),
reducer_params={
"mime_type": "text/plain",
},
shards=4))
This launches the same job as the first example, takes the output from that job, and feeds it into the second job, which reverses each entry. Notice that the result of the first pipeline yield is passed in to mapper_params of the second job. The pipeline library uses magic to detect that the second pipeline depends on the first one finishing and does not launch it until the deduped_blob_key has resolved.
Next, I had to create the BlobKeys helper class. At first, I didn’t think this was necessary, since I could just do:
mapper_params={"blob_keys": deduped_blob_key},
But, this didn’t work for two reasons. The first is that “generator pipelines cannot directly access the outputs of the child Pipelines that it yields”. The code above would require the generator pipeline to create a temporary dict object with the output of the first job, which is not allowed. The second is that the string returned by BlobstoreOutputWriter is of the format “/blobstore/”, but BlobstoreLineInputReader expects simply “”. To solve these problems, I made a little helper BlobKeys class. You’ll find yourself doing this for many jobs, and the pipeline library even includes a set of common wrappers, but they do not work within the MapreducePipeline framework, which I discuss at the bottom of this section.
class BlobKeys(third_party.mapreduce.base_handler.PipelineBase):
"""Returns a dictionary with the supplied keyword arguments."""
def run(self, keys):
# Remove the key from a string in this format:
# /blobstore/<key>
return {
"blob_keys": [k.split("/")[-1] for k in keys]
}
Here is the code for the map2 and reduce2 functions:
def map2(data):
# BlobstoreLineInputReader.next() returns a tuple
start_position, line = data
# Split input based on previous reduce() output format
elements = line.split(" - ")
random_id = elements[0]
char = elements[1]
# Swap 'em
yield (char, random_id)
def reduce2(key, values):
# Create the reverse index entry
yield "%s - %s\n" % (key, ",".join(values))
I'm unfamiliar with google-app-engine, however couldn't you put all of the job-configurations in a single main program and then run them in sequence? something like the following? I think this works in normal map-reduce programs, so if google-app-engine code isn't too different it should work fine.
Configuration conf1 = getConf();
Configuration conf2 = getConf();
Configuration conf3 = getConf();
Configuration conf4 = getConf();
//whatever configuration you do for the jobs
Job job1 = new Job(conf1,"name1");
Job job2 = new Job(conf2,"name2");
Job job3 = new Job(conf3,"name3");
Job job4 = new Job(conf4,"name4");
//setup for the jobs here
job1.waitForCompletion(true);
job2.waitForCompletion(true);
job3.waitForCompletion(true);
job4.waitForCompletion(true);
You need the appengine-pipeline project, which is meant for exactly this.

Categories