I have a PMML model that was exported from Python and I'm using that in Spark for downstream processing. Since the jpmml Evaluator isn't serializable, I'm using it inside mapPartitions. This works fine but takes a while to complete, as the mapPartition would have to materialize the iterator and collect/build the new RDD. I'm wondering if there's a more optimal way to execute the Evaluator.
I've noticed that when Spark is executing this rdd, my CPU is under utilized (drops to ~30%). Also from the SparkUI, the TaskTime (GC Time) is Red at 53s/15s
JavaRDD<List<ClassifiedPojo>> classifiedRdd = toBeClassifiedRdd.mapPartitions( r -> {
// initialized JPMML evaluator
List<ClassifiedPojo> list;
while(r.hasNext()){
// classify
list.add(new ClassifiedPojo())
}
return list.iterator();
});
Finally! I had to do 2 things.
First, I had to fix the SAX Locator by running this:
LocatorNullifier locatorNullifier = new LocatorNullifier();
locatorNullifier.applyTo(pmml);
Second, I refactored my mapPartitions to use Streams, details here.
This gave me a big boost. Hope it helps
Related
(Goal Updated)
My goal on each data stream is:
filter different msgs
have different event time defined window session gaps
consumer from topic and produce to another topic
A fan-out -> fan-in like DAG.
var fanoutStreamOne = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamTwo = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamThree = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreams = Set.of(fanoutStreamOne, fanoutStreamTwo, fanoutStreamThree)
var source = new FlinkKafkaConsumer<>(...);
var sink = new FlinkKafkaProducer<>(...);
// creates streams from same source to same sink (Using union())
new streamingJob(source, sink, fanoutStreams).execute();
I am just curious if this affects recovery/checkpoints or performance of the Flink application.
Has anyone had success with this implementation?
And should I have the watermark strategy up front before filtering?
Thank in advance!
Okay, the differenced time gaps are not possible, I think so. I tried it a year ago, with flink 1.7 , and I can't do it. The watermark is global to the application.
To the other problems, if you are using Kafka, yo can read from some topics using regex, and get the topic using the properly deserialization schema (here).
To filter the messages, I think you can use the filter functions with the dide output streams :) (here)
I'm currently trying to run a batch processing job in groovy with Gmongo driver, the collection is about 8 gigs my problem is that my script tries to load everything in-memory, ideally I'd like to be able to process this in batch similar to what Spring Boot Batch does but in groovy scripts
I've tried batchSize(), but this function still retrieves the entire collection into memory only to apply it to my logic in batch-process.
here's my example
momngoDb.collection.find().collect() it -> {
//logic
}
according to official doc:
https://docs.mongodb.com/manual/tutorial/iterate-a-cursor/#read-operations-cursors
def myCursor = db.collection.find()
while (myCursor.hasNext()) {
print( myCursor.next() }
}
After deliberation I found this solution to works best for the following reasons.
Unlike the Cursor it doesn't retrieve documents on a singular basis for processing (which can be terribly slow)
Unlike the Gmongo batch funstion, it also doesn't try to upload the the entire collection in memory only to cut it up in batches for process, this tends to be heavy on machine resources.
code below is efficient and light on resource depending on your batch size.
def skipSize = 0
def limitSize = Integer.valueOf(1000) batchSize (if your going to hard code the batch size then you dont need the int convertion)
def dbSize = Db.collectionName.count()
def dbRunCount = (dbSize / limitSize).round()
dbRunCount.times { it ->
dstvoDsEpgDb.schedule.find()
.skip(skipSize)
.limit(limitSize)
.collect { event ->
//run your business logic processing
}
//calculate the next skipSize
skipSize += limitSize
}
Given some code using streams to process a large number of items, what's the best way to instrument the various steps for logging and performance/profiling?
Actual example:
ReactiveSeq.fromStream(pairs)
.filter(this::satisfiesThreshold)
.filter(this::satisfiesPersistConditions)
.map((pair) -> convertToResult(pair, jobId))
.flatMap(Option::toJavaStream)
.grouped(CHUNK_SIZE)
.forEach((chunk) ->
{
repository.save(chunk);
incrementAndReport();
});
reportProcessingTime();
Logging progress is important so I can trigger progress events in another thread that update a user interface.
Tracking the performance characteristics of the filtering and mapping steps in this stream is desireable to see where optimizations can be made to speed it up.
I see three options:
put logging/profiling code in each function
use peek around each step without actually using the value
some sort of annotation based or AOP solution (no idea what)
Which is the best? Any ideas on what #3 would look like? Is there another solution?
You have a couple of options here (if I have understood correctly) :-
We can make use of the elapsed operator to track the elapsed time between element emissions e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
Long[] filterTimeTakenMillis = new Long[maxSize];
int filterIndex = 0;
private <T> T logAndUnwrap(Tuple2<T, Long> t) {
//capture the elapsed time (t.v2) and then unwrap the tuple
filterTimeTakenMillis[filterIndex++]=t.v2;
return t.v1;
}
This will only work on cyclops-react Streams.
We can make use of the AOP-like functionality in FluentFunctions
e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
.map(FluentFunctions.of(this::convertToResult)
.around(a->{
SimpleTimer timer = new SimpleTimer();
String r = a.proceed();
mapTimeTakenNanos[mapIndex++]=timer.getElapsedNanos();
return r;
}));
This will also work on vanilla Java 8 Streams.
I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)
I'm hoping that someone familiar with Spark can give me a "gut check" on whether I'm likely abusing the SparkML framework or if the performance I'm seeing is understandable given the context (#rows, #features).
Briefly, I have a small dataset (~150 rows) that is fairly wide (~180 features). I have coded up analogous Lasso training codes in Spark and Scikit-learn, which result in identical models (same model coefficients and LOOCVE). However, the Spark code takes roughly 100x longer (sklearn takes about 5 seconds, close to 600 secs.
I understand that Spark is optimized for large distributed datasets and that this difference can reasonably attributed to overhead latency that would be hidden by data parallelism, but this still feels extremely sluggish.
The spark code is essentially:
//... code to add a number of PipelineStages to a List<PipelineStage> (~90 UnaryTransformer stages), ending in a StandardScaler
// Add Lasso model
LinearRegression lasso = new LinearRegression()
.setLabelCol(response)
.setFeaturesCol("normed_features")
.setMaxIter(100000)
.setPredictionCol(response+"_prediction")
.setElasticNetParam(1.0)
.setFitIntercept(true)
.setRegParam(0.2);
// stages is the List<PipelineStage> loaded with 90 or so UnaryTransformer steps
stages.add(lasso);
Pipeline pipeline = new Pipeline(stages);
DataFrame df = getTrainingData(trainingData, response);
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol(response)
.setMetricName("mae")
.setPredictionCol(response+"_prediction")
);
df.cache();
ParamMap[] paramGrid = new ParamGridBuilder().build();
CrossValidator cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(20);
double cve = cv.fit(df).avgMetrics()[0];
the Python code uses Lasso and GridSearchCV with the same #folds (20).
Unfortunately, I can't really provide a MWE as we use a custom Transformer that I'd have to paste in, but I'm wondering if anyone would be willing to weigh in on whether this runtime difference between sklearn and spark implies user error. The only good practice I am knowingly applying is caching the training DataFrame before fitting the CrossValidator.