I'm hoping that someone familiar with Spark can give me a "gut check" on whether I'm likely abusing the SparkML framework or if the performance I'm seeing is understandable given the context (#rows, #features).
Briefly, I have a small dataset (~150 rows) that is fairly wide (~180 features). I have coded up analogous Lasso training codes in Spark and Scikit-learn, which result in identical models (same model coefficients and LOOCVE). However, the Spark code takes roughly 100x longer (sklearn takes about 5 seconds, close to 600 secs.
I understand that Spark is optimized for large distributed datasets and that this difference can reasonably attributed to overhead latency that would be hidden by data parallelism, but this still feels extremely sluggish.
The spark code is essentially:
//... code to add a number of PipelineStages to a List<PipelineStage> (~90 UnaryTransformer stages), ending in a StandardScaler
// Add Lasso model
LinearRegression lasso = new LinearRegression()
.setLabelCol(response)
.setFeaturesCol("normed_features")
.setMaxIter(100000)
.setPredictionCol(response+"_prediction")
.setElasticNetParam(1.0)
.setFitIntercept(true)
.setRegParam(0.2);
// stages is the List<PipelineStage> loaded with 90 or so UnaryTransformer steps
stages.add(lasso);
Pipeline pipeline = new Pipeline(stages);
DataFrame df = getTrainingData(trainingData, response);
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol(response)
.setMetricName("mae")
.setPredictionCol(response+"_prediction")
);
df.cache();
ParamMap[] paramGrid = new ParamGridBuilder().build();
CrossValidator cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(20);
double cve = cv.fit(df).avgMetrics()[0];
the Python code uses Lasso and GridSearchCV with the same #folds (20).
Unfortunately, I can't really provide a MWE as we use a custom Transformer that I'd have to paste in, but I'm wondering if anyone would be willing to weigh in on whether this runtime difference between sklearn and spark implies user error. The only good practice I am knowingly applying is caching the training DataFrame before fitting the CrossValidator.
Related
I have a PMML model that was exported from Python and I'm using that in Spark for downstream processing. Since the jpmml Evaluator isn't serializable, I'm using it inside mapPartitions. This works fine but takes a while to complete, as the mapPartition would have to materialize the iterator and collect/build the new RDD. I'm wondering if there's a more optimal way to execute the Evaluator.
I've noticed that when Spark is executing this rdd, my CPU is under utilized (drops to ~30%). Also from the SparkUI, the TaskTime (GC Time) is Red at 53s/15s
JavaRDD<List<ClassifiedPojo>> classifiedRdd = toBeClassifiedRdd.mapPartitions( r -> {
// initialized JPMML evaluator
List<ClassifiedPojo> list;
while(r.hasNext()){
// classify
list.add(new ClassifiedPojo())
}
return list.iterator();
});
Finally! I had to do 2 things.
First, I had to fix the SAX Locator by running this:
LocatorNullifier locatorNullifier = new LocatorNullifier();
locatorNullifier.applyTo(pmml);
Second, I refactored my mapPartitions to use Streams, details here.
This gave me a big boost. Hope it helps
I am trying to use mahout for the recommendation but getting none.
My dataset :
0,102,5.0
1,101,5.0
1,102,5.0
Code :
DataModel datamodel = new FileDataModel(new File("dataset.csv"));
// Creating UserSimilarity object.
UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
// Creating UserNeighbourHHood object.
UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel);
// Create UserRecomender
UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
List<RecommendedItem> recommendations = recommender.recommend(0, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
I am using Mahout version : 0.13.0
Ideally, it should recommend item_id = 101' to 'user_id = 0' asuser = 0anduser = 1have item 102 common show it should recommenditem_id = 101touser_id = 0`
Logs :
18:08:11.669 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Creating FileDataModel for file dataset.csv
18:08:11.700 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Reading file info...
18:08:11.702 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Read lines: 3
18:08:11.722 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel - Processed 2 users
18:08:11.738 [main] DEBUG org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender - Recommending items for user ID '0'
The Hadoop Mapreduce code in Mahout is being deprecated. The new recommender code starts with #rawkintrevo 's examples. If you are a Scala programmer follow them.
Most Engineers would like a system that works with no modification, The Mahout algorithm is encapsulated in The Universal Recommender built on top of Apache PredictionIO. It has a server to accept events, like the ones in your example, it has internal event storage, and a query server for results. There are numerous improvements over the old Mapreduce code, including using real-time user behavior to make recommendations. Neither the new Mahout nor the old included servers for input and query, the Universal Recommender has REST endpoints for both.
Given that the code you are using will be deprecated I strongly suggest that you dive into Mahout code (#rawkintrevo's example) or look at The Universal Recommender, which is an entire end-to-end system.
Install PredictionIO with a "single machine" setup here or to really shortcut setup use our prepackaged AWS AMI here It includes PIO and The Universal Recommender pre-installed.
Add the UR Template here
A Java SDK for sending events to the recommender here
Once you have this setup you deal with config, REST or Java SDK and the PIO CLI. No Scala coding required.
I have three examples that are based on version 0.13.0 (and Scala, which is required for Samsara, the R-Like Scala DSL Mahout utilizes v0.10+)
Walk
The first example is a very slow walk through:
https://gist.github.com/rawkintrevo/3869030ff1a731d43c5e77979a5bf4a8
and is meant as a companion to Pat Ferrels blog post/slide deck found here.
http://actionml.com/blog/cco
Crawl
The second example is a little more "real" in that it utilizes the SimilarityAnalysis.cooccurrencesIDSs(... which is the propper interface for the CCO algorithm.
https://gist.github.com/rawkintrevo/c1bb00896263bdc067ddcd8299f4794c
Run
Here we use 'real' data. The MovieLens data set doesn't have enough going on to showcase CCO's multi-modal power (the ability to recommend on multiple user behaviors). Here we load 'real' data and generate recommendations.
https://gist.github.com/rawkintrevo/f87cc89f4d337d7ffea80a6af3bee83e
Conclusion
I know you specifically asked for Java, however Apache Mahout isn't geared for Java at the moment. In theory you could import Scala into your java, or maybe wrap the functions in another more Java friendly function... I've heard rumors late at night (or possibly in a dream) that some grad students some where were working on a Java API, but its not in the trunk at the moment, nor is there a PR, nor is their a bullet in the road map.
Hope the above provides some insight.
Appendix
The most trivial example for Stackoverflow (you can run this interactively in the Mahout spark shell by typing $MAHOUT_HOME/bin/mahout spark-shell (assuming SPARK_HOME, JAVA_HOME and MAHOUT_HOME are set):
val inputRDD = sc.parallelize(Array( ("u1", "purchase", "iphone"),
("u1","purchase","ipad"),
("u2","purchase","nexus"),
("u2","purchase","galaxy"),
("u3","purchase","surface"),
("u4","purchase","iphone"),
("u4","purchase","galaxy"),
("u1","category-browse","phones"),
("u1","category-browse","electronics"),
("u1","category-browse","service"),
("u2","category-browse","accessories"),
("u2","category-browse","tablets"),
("u3","category-browse","accessories"),
("u3","category-browse","service"),
("u4","category-browse","phones"),
("u4","category-browse","tablets")) )
import org.apache.mahout.math.indexeddataset.{IndexedDataset, BiDictionary}
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
val purchasesIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "purchase").map(o => (o._1, o._3)))(sc)
val browseIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "category-browse").map(o => (o._1, o._3)))(sc)
import org.apache.mahout.math.cf.SimilarityAnalysis
val llrDrmList = SimilarityAnalysis.cooccurrencesIDSs(Array(purchasesIDS, browseIDS),
randomSeed = 1234,
maxInterestingItemsPerThing = 3,
maxNumInteractions = 4)
val llrAtA = llrDrmList(0).matrix.collect
IndexedDatasetSpark.apply( requires an RDD[(String, String)] where the first string is the 'row' (e.g. users), second string is the 'behavior' so for the 'buy matrix', the columns would be 'products', but this could also be a 'gender' matrix, with two columns (male/female)
Then you pass an array of IndexedDataSets to SimilarityAnalysis.cooccurrencesIDSs(
I'm trying to use findAllPaths find shortest paths between two random nodes, but the program just stuck there with no output(it works pretty fine when I doing the Unit test with a toy db which is relatively small). The CPU usage is quite low so I do not think it is doing the computing while sticking.
So I try to use the ExecutionEngine instead, and that works prefect fine and returns the result in several microseconds. I just curious why the Algorithm API does not work with a big database(8 million nodes and 300 million directed edges)? Or I made some mistake? I barely follow the tutorial on their website http://docs.neo4j.org/chunked/stable/tutorials-java-embedded-graph-algo.html, all I change is I do not specify any particular relation type in my search.
By the way, My Neo4j version is 1.9.4.
Here is the code :
while(startNode==null){
startNode = nodeFinder.getSingleNodeByIndex("wikipage", "id", random.nextInt(MAX_TUPLE));
}
while(stopNode==null){
stopNode = nodeFinder.getSingleNodeByIndex("wikipage", "id", random.nextInt(MAX_TUPLE));
}
This part works
ExecutionEngine engine = new ExecutionEngine(GraphDataBase.getInstance().getGraphDB());
ExecutionResult result = engine.execute("start source=node:wikipage(id=\""+startNode.getId()+"\"), dest=node:wikipage(id=\""+stopNode.getId()+"\") match p=allShortestPaths(source-[r:WIKI_LINK*..8]->dest) return nodes(p);");
But this does not
PathFinder<Path> finder = GraphAlgoFactory.shortestPath(Traversal.expanderForAllTypes(Direction.OUTGOING), 8 );
Iterable<Path> paths = finder.findAllPaths( startNode, stopNode );
I downloaded Calabash XML a couple of days back and got it working easily enough from the command prompt. I then tried to run it from Java code I noticed there was no API (e.g. the Calabash main method is massive with code calls to everywhere). To get it working was very messy as I had to copy huge chunks from the main method to a wrapper class, and divert from the System.out to a byte array output stream (and eventually into a String) i.e.
...
ByteArrayOutputStream baos = new ByteArrayOutputStream (); // declare at top
...
WritableDocument wd = null;
if (uri != null) {
URI furi = new URI(uri);
String filename = furi.getPath();
FileOutputStream outfile = new FileOutputStream(filename);
wd = new WritableDocument(runtime,filename,serial,outfile);
} else {
wd = new WritableDocument(runtime,uri,serial, baos); // new "baos" parameter
}
The performance seems really, really slow e.g. i ran a simple filter 1000 times ...
<p:filter>
<p:with-option name="select" select="'/result/meta-data/neighbors/document/title'" />
</p:filter>
On average each time took 17ms which doesn't seem like much but my spring REST controller with calls to Mongo DB and encryption calls etc take on average 3/4 ms.
Has anyone encountered this when running Calabash from code? Is there something I can do to speed things up?
For example, I this is being called each time -
XProcRuntime runtime = new XProcRuntime(config);
Can this be created once and reused? Any help is appreciated as I don't want to have to pay money to use Calamet but really want to get Xproc working from code to an acceptable performance.
For examples on how you could integrate XMLCalabash in a framework, I can mention Servlex by Florent Georges. You'd have to browse the code to find the relevant bit, but last time I looked it shouldn't be too hard to find:
http://servlex.net/
XMLCalabash wasn't build for speed unfortunately. I am sure that if you run profile, and can find some hotspots, Norm Walsh would be interested to hear about it.
Alternative is to look into Quixprox, which is derived from XMLCalabash:
https://code.google.com/p/quixproc/
I am also very sure that if you can send Norm a patch to improve the main class for better integration, he'd be interested to hear about it. In fact, the code should be on github, just fork it, fix it, and do a pull request..
HTH!
I am rolling my own simple web-based perfmon, I am not happy with some of the data I can get like cpu usage, which i use via a sql query. I am able to get memory usage just fine...I will attach a screenshot, so you can see what I currently have for my main/home/dashboard page.
I am currently using webcharts3d, which i am loving being able to use ajax, update the chart, and i have a dynamically updating dashboard. Yes of course I have to get only a few performance counter's so in my desire to have a web-based performance dashboard i do not kill the server.
DECLARE #CPU_BUSY int, #IDLE int
SELECT #CPU_BUSY = ##CPU_BUSY, #IDLE = ##IDLE WAITFOR DELAY '000:00:01'
SELECT (##CPU_BUSY - #CPU_BUSY)/((##IDLE - #IDLE + ##CPU_BUSY - #CPU_BUSY) *1.00) *100 AS 'CPU'
And all i get in results is 0.0000, so either the query is wrong, or i have very little cpu activity going on. Where as when I use my windows task manager.
Here is the code for gathering memory I am using, I do not claim credit for any of this code, I found it somewhere.
<cfscript>
jRuntime = CreateObject("java","java.lang.Runtime").getRuntime();
memory = StructNew();
memory.freeAllocated = jRuntime.freeMemory() / 1024^2;
memory.allocated = jRuntime.totalMemory() / 1024^2;
memory.used = memory.allocated - memory.freeAllocated;
memory.percentUsedAllo = (memory.used / memory.allocated) * 100;
</cfscript>
SysAdmin http://a.imageshack.us/img826/2575/sysadminscreenshot.png
So I am looking for more wmi or java or scripts to get cpu usage, and perhaps any other important server stat.
How about using Coldfusion built-in function called, GetMetricData. It can help you to monitor your server performance like Coldfusion Admin. I've done it with bar of cfchart. If you wanna integrate with Web3Dcharts, you can.
http://ppshein.wordpress.com/2010/08/04/getmetricdata-for-server-monitor/
<cfset pmData = GetMetricData(“PERF_MONITOR”) >
<cfchart chartheight=”500″ chartwidth=”700″ format=”PNG” showlegend=”yes”>
<cfchartseries type=”bar” seriescolor=”##639526″ paintstyle=”light” colorlist=”##ff8080,##ffff80,##80ff80,##0080ff,##ff80c0,##ff80ff,##ff8040,##008000,##0080c0,##808000″>
<cfchartdata item=”Page Hits” value=”#pmData.PageHits#”>
<cfchartdata item=”Request Queued” value=”#pmData.ReqQueued#”>
<cfchartdata item=”Database Hits” value=”#pmData.DBHits#”>
<cfchartdata item=”Request Running” value=”#pmData.ReqRunning#”>
<cfchartdata item=”Request TimedOut” value=”#pmData.ReqTimedOut#”>
<cfchartdata item=”Bytes In” value=”#pmData.BytesIn#”>
<cfchartdata item=”Bytes Out” value=”#pmData.BytesOut#”>
<cfchartdata item=”Avg Queue Time” value=”#pmData.AvgQueueTime#”>
<cfchartdata item=”Avg Request Time” value=”#pmData.AvgReqTime#”>
<cfchartdata item=”Avg Database Time” value=”#pmData.AvgDBTime#”>
</cfchartseries>
</cfchart>
Another solution:
Then using the Reliability and Performance Monitor (i.e. perfmon), create a counter for CPU (Total) - it should be in the long list of Windows counters.
You can save this data to file or to a database. If you save it to a database you can then use CF to query that data and get pretty accurate performance info. You can of course display this on a graph over time which is a massive benefit in my opinion.
When you have that done you can then enable performance monitoring in CF admin, and you will then have CF performance metrics available to pick up in perfmon.
We have successfully implemented this solution across a CF cluster of 10+ machines and it gives a an excellent idea of server performance at a given point in time and historically.
CfTracker probably has the code you need, and since it uses the Apache License you can simply grab any relevant stuff from it, once you attribute appropriately.
It would be even better if you could go a step further and talk to Dave Boyet about combining your two tools - or at least collaborating on the common bits.
To more directly answer your question, here's a blog article explaining how to use WMI from ColdFusion.