Pyspark Create Model with Coefficient & Intercept - java

I am wondering if it is possible to construct a model(linear regression / logistic regression only with coefficient & intercept) For scikit-learn, things worked smoothly -- I could just set those variables for the model and predict worked.
For pyspark, I am having more trouble. I couldn't set those variables in scala. Since model takes in java_model as parameters, I am trying to create a java_model with pyspark/py4j and use it to create a pyspark model.
Here is what I am trying to do as a test.
from pyspark import SparkContext, SQLContext
from pyspark.mllib.linalg import DenseVector
sc = SparkContext.getOrCreate()
sql_ctx = SQLContext(sc)
vect = DenseVector([1.0, 2.0])
test = sc._jvm.org.apache.spark.ml.regression.LinearRegressionModel(vect, 1.0)
but then I get this error
AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'
Seems like vect has self.array which is ndarray, and py4j cannot convert it to java DenseVector. Has anyone tried a similar attempt?

Related

How to use MLeap DenseTensor in Java

I am using MLeap to run a Pyspark logistic regression model in a java program. Once I run the pipeline I am able to get a DefaultLeapFrame object with one row Stream(Row(1.3,12,3.6,DenseTensor([D#538613b3,List(2)),1.0), ?).
But I am not sure how to actually inspect the DenseTensor object. When I use getTensor(3) on this row I get an object. I am not familiar with Scala but that seems to be how this is meant to be interacted with. In Java how can I get the values within this DenseVector?
Here is roughly what I am doing. I'm guessing using Object is not right for the type. . .
DefaultLeapFrame df = leapFrameSupport.select(frame2, Arrays.asList("feat1", "feat2", "feat3", "probability", "prediction"));
Tensor<Object> tensor = df.dataset().head().getTensor(3);
Thanks
So the MLeap documentation for the Java DSL is not so good but I was able to look over some unit tests (link) that pointed me to the right thing to use. In case anyone else is interested, this is what I did.
DefaultLeapFrame df = leapFrameSupport.select(frame, Arrays.asList("feat1", "feat2", "feat3", "probability", "prediction"));
TensorSupport tensorSupport = new TensorSupport();
List<Double> tensor_vals = tensorSupport.toArray(df.dataset().head().getTensor(3));

Optimize Random Forest parameters in weka?

I am trying to optimize random forest parameters using weka, the java class is as the following:
package pkg10foldcrossvalidation;
import weka.core.*;
import weka.classifiers.meta.*;
import weka.classifiers.trees.RandomForest;
import java.io.*;
public class RF_Optimizer {
public static void main(String[] args) throws Exception {
// load data
BufferedReader reader = new BufferedReader(new FileReader("C:\\Prediction Results on the testing set\\Dataset.arff"));
Instances data = new Instances(reader);
reader.close();
data.setClassIndex(data.numAttributes() - 1);
// setup classifier
CVParameterSelection ps = new CVParameterSelection();
ps.setClassifier(new RandomForest());
ps.setNumFolds(10); // using 10-fold CV
ps.addCVParameter("C 0.1 0.5 5");
// build and output best options
ps.buildClassifier(data);
System.out.println(Utils.joinOptions(ps.getBestClassifierOptions()));
}
}
But I am facing difficulty of understanding which parameters should replace the "C" and how the range of each one could be determined? And is it workable to use .addCVParameter several times for several parameters at the same time?
I tried to search for some youtube or website tutorials that explain how to change random forest parameters in java but nothing found.
Thank you
I think what you are describing, -C are the Cross-Validation parameters, not the RandomForest parameters.
Can't you just use the Explorer GUI, open a sample dataset such as glass.arff, and then right-click on the bold RandomForest string at the top of the window, then from the context menu choose "copy configuration to clipboard", and then paste that string into your java code?
After doing this right now, I've copied this string to the clipboard:
weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1
These are the default parameters for Weka's RandomForest learner. What these parameters mean, and which of them is most suitable for optimization, and which range of values to use for optimization I really can't tell. Most likely a very important parameter is numIterations, the -I parameter. Maybe vary it from 100, 200,... to 1000 and plot numIterations vs Accuracy, and check if the curve has smoothed out already.

H2O : NullPointerException error while building ensemble model using deep learning grid

I am trying to build a stacked ensemble model to predict merchant churn using R (version 3.3.3) and deep learning in h2o (version 3.10.5.1). The response variable is binary. At the moment I am trying run the code to build a stacked ensemble model using the top 5 models developed by the grid search. However, when the code is run, I get the java.lang.NullPointerException error with the following output:
java.lang.NullPointerException
at hex.StackedEnsembleModel.checkAndInheritModelProperties(StackedEnsembleModel.java:265)
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:115)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Below is the code that I've used to do the hyper-parameter grid search and build the ensemble model:
hyper_params <- list(
activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
hidden=list(c(50,50),c(30,30,30),c(32,32,32,32,32),c(64,64,64,64,64),c(100,100,100,100,100)),
input_dropout_ratio=seq(0,0.2,0.05),
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6),
rho = c(0.9,0.95,0.99,0.999),
epsilon=c(1e-10,1e-09,1e-08,1e-07,1e-06,1e-05,1e-04)
)
search_criteria <- list(
strategy = "RandomDiscrete",
max_runtime_secs = 3600,
max_models = 100,
seed=1234,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5
)
dl_ensemble_grid <- h2o.grid(
hyper_params = hyper_params,
search_criteria = search_criteria,
algorithm="deeplearning",
grid_id = "final_grid_ensemble_dl",
x=predictors,
y=response,
training_frame = h2o.rbind(train, valid, test),
nfolds=5,
fold_assignment="Modulo",
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
epochs=12,
max_runtime_secs = 3600,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5,
seed = 1234,
max_w2=10
)
DLsortedGridEnsemble_logloss <- h2o.getGrid("final_grid_ensemble_dl",sort_by="logloss",decreasing=FALSE)
ensemble <- h2o.stackedEnsemble(x = predictors,
y = response,
training_frame = h2o.rbind(train,valid,test),
base_models = list(
DLsortedGridEnsemble_logloss#model_ids[[1]],
DLsortedGridEnsemble_logloss#model_ids[[2]],
DLsortedGridEnsemble_logloss#model_ids[[3]],
DLsortedGridEnsemble_logloss#model_ids[[4]],
DLsortedGridEnsemble_logloss#model_ids[[5]],
)
Note: what I have realised so far is that h2o.stackedEnsemble function works when there's only one base model and it gives the Java error as soon as there's two or more base models.
I would really appreciate if I could get some feedback as to how this could be resolved.
The error refers to a line of the StackedEnsembleModel.java code that checks that the training_frame in the base models and the training_frame in h2o.stackedEnsemble() have the same checksum. I think the problem is caused because you dynamically created the training frame, rather than defining it explicitly (even though that should work since it's the same data in the end). So, rather than setting training_frame = h2o.rbind(train, valid, test) in the grid and ensemble functions, set the following at the top of your code:
df <- h2o.rbind(train, valid, test)
And then set training_frame = df in the grid and ensemble functions.
As a side note, you may get better DL models if you use a validation frame (for early stopping), rather than using all your data for the training frame. Also, if you want to use all the models in your grid (might lead to better performance, but not always), you can set base_models = DLsortedGridEnsemble_logloss#model_ids in the h2o.stackedEnsemble() function.

Apache Mahout not giving any recommendation

I am trying to use mahout for the recommendation but getting none.
My dataset :
0,102,5.0
1,101,5.0
1,102,5.0
Code :
DataModel datamodel = new FileDataModel(new File("dataset.csv"));
// Creating UserSimilarity object.
UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
// Creating UserNeighbourHHood object.
UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel);
// Create UserRecomender
UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
List<RecommendedItem> recommendations = recommender.recommend(0, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
I am using Mahout version : 0.13.0
Ideally, it should recommend item_id = 101' to 'user_id = 0' asuser = 0anduser = 1have item 102 common show it should recommenditem_id = 101touser_id = 0`
Logs :
18:08:11.669 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Creating FileDataModel for file dataset.csv
18:08:11.700 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Reading file info...
18:08:11.702 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Read lines: 3
18:08:11.722 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel - Processed 2 users
18:08:11.738 [main] DEBUG org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender - Recommending items for user ID '0'
The Hadoop Mapreduce code in Mahout is being deprecated. The new recommender code starts with #rawkintrevo 's examples. If you are a Scala programmer follow them.
Most Engineers would like a system that works with no modification, The Mahout algorithm is encapsulated in The Universal Recommender built on top of Apache PredictionIO. It has a server to accept events, like the ones in your example, it has internal event storage, and a query server for results. There are numerous improvements over the old Mapreduce code, including using real-time user behavior to make recommendations. Neither the new Mahout nor the old included servers for input and query, the Universal Recommender has REST endpoints for both.
Given that the code you are using will be deprecated I strongly suggest that you dive into Mahout code (#rawkintrevo's example) or look at The Universal Recommender, which is an entire end-to-end system.
Install PredictionIO with a "single machine" setup here or to really shortcut setup use our prepackaged AWS AMI here It includes PIO and The Universal Recommender pre-installed.
Add the UR Template here
A Java SDK for sending events to the recommender here
Once you have this setup you deal with config, REST or Java SDK and the PIO CLI. No Scala coding required.
I have three examples that are based on version 0.13.0 (and Scala, which is required for Samsara, the R-Like Scala DSL Mahout utilizes v0.10+)
Walk
The first example is a very slow walk through:
https://gist.github.com/rawkintrevo/3869030ff1a731d43c5e77979a5bf4a8
and is meant as a companion to Pat Ferrels blog post/slide deck found here.
http://actionml.com/blog/cco
Crawl
The second example is a little more "real" in that it utilizes the SimilarityAnalysis.cooccurrencesIDSs(... which is the propper interface for the CCO algorithm.
https://gist.github.com/rawkintrevo/c1bb00896263bdc067ddcd8299f4794c
Run
Here we use 'real' data. The MovieLens data set doesn't have enough going on to showcase CCO's multi-modal power (the ability to recommend on multiple user behaviors). Here we load 'real' data and generate recommendations.
https://gist.github.com/rawkintrevo/f87cc89f4d337d7ffea80a6af3bee83e
Conclusion
I know you specifically asked for Java, however Apache Mahout isn't geared for Java at the moment. In theory you could import Scala into your java, or maybe wrap the functions in another more Java friendly function... I've heard rumors late at night (or possibly in a dream) that some grad students some where were working on a Java API, but its not in the trunk at the moment, nor is there a PR, nor is their a bullet in the road map.
Hope the above provides some insight.
Appendix
The most trivial example for Stackoverflow (you can run this interactively in the Mahout spark shell by typing $MAHOUT_HOME/bin/mahout spark-shell (assuming SPARK_HOME, JAVA_HOME and MAHOUT_HOME are set):
val inputRDD = sc.parallelize(Array( ("u1", "purchase", "iphone"),
("u1","purchase","ipad"),
("u2","purchase","nexus"),
("u2","purchase","galaxy"),
("u3","purchase","surface"),
("u4","purchase","iphone"),
("u4","purchase","galaxy"),
("u1","category-browse","phones"),
("u1","category-browse","electronics"),
("u1","category-browse","service"),
("u2","category-browse","accessories"),
("u2","category-browse","tablets"),
("u3","category-browse","accessories"),
("u3","category-browse","service"),
("u4","category-browse","phones"),
("u4","category-browse","tablets")) )
import org.apache.mahout.math.indexeddataset.{IndexedDataset, BiDictionary}
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
val purchasesIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "purchase").map(o => (o._1, o._3)))(sc)
val browseIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "category-browse").map(o => (o._1, o._3)))(sc)
import org.apache.mahout.math.cf.SimilarityAnalysis
val llrDrmList = SimilarityAnalysis.cooccurrencesIDSs(Array(purchasesIDS, browseIDS),
randomSeed = 1234,
maxInterestingItemsPerThing = 3,
maxNumInteractions = 4)
val llrAtA = llrDrmList(0).matrix.collect
IndexedDatasetSpark.apply( requires an RDD[(String, String)] where the first string is the 'row' (e.g. users), second string is the 'behavior' so for the 'buy matrix', the columns would be 'products', but this could also be a 'gender' matrix, with two columns (male/female)
Then you pass an array of IndexedDataSets to SimilarityAnalysis.cooccurrencesIDSs(

Constructor error while creating an empty dataset in weka

I am trying to classify an instance using the classifyInstance method (described in weka's documentation here) using the Matlab environment.
This method require the instance to be link to a dataset. I am trying to use this constructor to create an empty dataset with the following matlab code:
import java.util.ArrayList.*;
import weka.core.*;
import weka.core.Instances.*;
attInfo = java.util.ArrayList;
attInfo.add(weka.core.Attribute('att1'));
attInfo.add(weka.core.Attribute('att2'));
attInfo.add(weka.core.Attribute('att3'));
dataset= weka.core.Instances(java.lang.String('relation'), attInfo, 2);
When I try to run this code matlab return me the following error:
No constructor 'weka.core.Instances' with matching signature found.
Error in file_name (line 109) dataset =
weka.core.Instances(java.lang.String('relation'), attInfo, 5);
What is wrong with the parameters of my constructor?
I end up finding the solution of the problem. The constructor accept a signature which use the deprecated class FastVector. I just added a snapshot of my code in case it might help someone.
attInfo = FastVector();
attInfo.addElement(weka.core.Attribute('att1'));
attInfo.addElement(weka.core.Attribute('att2'));
attInfo.addElement(weka.core.Attribute('att3'));
% build the class attribute
classValues = FastVector();
classValues.addElement(java.lang.String('0'));
classValues.addElement(java.lang.String('1'));
attInfo.addElement(Attribute('Class', classValues));
% create the dataset and define the class attribute
dataset = Instances('relation', attInfo, 1);
dataset.setClassIndex(dataset.numAttributes() -1);
% build the instance
Inst = weka.core.Instance(10);
for ii = 1:D.numAttributes()
Inst.setValue(D.attribute(ii-1), 1)
end
Inst.setDataset(dataset)
% classify the instance
classifier.classifyInstance(Inst)
The use of java object such as java.lang.String() also lead to an error.
I am still curious about why this is happening, but I suspect that might be because of the version of weka that I am using (3.6.11) where the documentation might be for the version 3.7.12 .

Categories