H2O : NullPointerException error while building ensemble model using deep learning grid - java

I am trying to build a stacked ensemble model to predict merchant churn using R (version 3.3.3) and deep learning in h2o (version 3.10.5.1). The response variable is binary. At the moment I am trying run the code to build a stacked ensemble model using the top 5 models developed by the grid search. However, when the code is run, I get the java.lang.NullPointerException error with the following output:
java.lang.NullPointerException
at hex.StackedEnsembleModel.checkAndInheritModelProperties(StackedEnsembleModel.java:265)
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:115)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Below is the code that I've used to do the hyper-parameter grid search and build the ensemble model:
hyper_params <- list(
activation=c("Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"),
hidden=list(c(50,50),c(30,30,30),c(32,32,32,32,32),c(64,64,64,64,64),c(100,100,100,100,100)),
input_dropout_ratio=seq(0,0.2,0.05),
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6),
rho = c(0.9,0.95,0.99,0.999),
epsilon=c(1e-10,1e-09,1e-08,1e-07,1e-06,1e-05,1e-04)
)
search_criteria <- list(
strategy = "RandomDiscrete",
max_runtime_secs = 3600,
max_models = 100,
seed=1234,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5
)
dl_ensemble_grid <- h2o.grid(
hyper_params = hyper_params,
search_criteria = search_criteria,
algorithm="deeplearning",
grid_id = "final_grid_ensemble_dl",
x=predictors,
y=response,
training_frame = h2o.rbind(train, valid, test),
nfolds=5,
fold_assignment="Modulo",
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE,
epochs=12,
max_runtime_secs = 3600,
stopping_metric="misclassification",
stopping_tolerance=0.01,
stopping_rounds=5,
seed = 1234,
max_w2=10
)
DLsortedGridEnsemble_logloss <- h2o.getGrid("final_grid_ensemble_dl",sort_by="logloss",decreasing=FALSE)
ensemble <- h2o.stackedEnsemble(x = predictors,
y = response,
training_frame = h2o.rbind(train,valid,test),
base_models = list(
DLsortedGridEnsemble_logloss#model_ids[[1]],
DLsortedGridEnsemble_logloss#model_ids[[2]],
DLsortedGridEnsemble_logloss#model_ids[[3]],
DLsortedGridEnsemble_logloss#model_ids[[4]],
DLsortedGridEnsemble_logloss#model_ids[[5]],
)
Note: what I have realised so far is that h2o.stackedEnsemble function works when there's only one base model and it gives the Java error as soon as there's two or more base models.
I would really appreciate if I could get some feedback as to how this could be resolved.

The error refers to a line of the StackedEnsembleModel.java code that checks that the training_frame in the base models and the training_frame in h2o.stackedEnsemble() have the same checksum. I think the problem is caused because you dynamically created the training frame, rather than defining it explicitly (even though that should work since it's the same data in the end). So, rather than setting training_frame = h2o.rbind(train, valid, test) in the grid and ensemble functions, set the following at the top of your code:
df <- h2o.rbind(train, valid, test)
And then set training_frame = df in the grid and ensemble functions.
As a side note, you may get better DL models if you use a validation frame (for early stopping), rather than using all your data for the training frame. Also, if you want to use all the models in your grid (might lead to better performance, but not always), you can set base_models = DLsortedGridEnsemble_logloss#model_ids in the h2o.stackedEnsemble() function.

Related

How to use MLeap DenseTensor in Java

I am using MLeap to run a Pyspark logistic regression model in a java program. Once I run the pipeline I am able to get a DefaultLeapFrame object with one row Stream(Row(1.3,12,3.6,DenseTensor([D#538613b3,List(2)),1.0), ?).
But I am not sure how to actually inspect the DenseTensor object. When I use getTensor(3) on this row I get an object. I am not familiar with Scala but that seems to be how this is meant to be interacted with. In Java how can I get the values within this DenseVector?
Here is roughly what I am doing. I'm guessing using Object is not right for the type. . .
DefaultLeapFrame df = leapFrameSupport.select(frame2, Arrays.asList("feat1", "feat2", "feat3", "probability", "prediction"));
Tensor<Object> tensor = df.dataset().head().getTensor(3);
Thanks
So the MLeap documentation for the Java DSL is not so good but I was able to look over some unit tests (link) that pointed me to the right thing to use. In case anyone else is interested, this is what I did.
DefaultLeapFrame df = leapFrameSupport.select(frame, Arrays.asList("feat1", "feat2", "feat3", "probability", "prediction"));
TensorSupport tensorSupport = new TensorSupport();
List<Double> tensor_vals = tensorSupport.toArray(df.dataset().head().getTensor(3));

Apache Mahout not giving any recommendation

I am trying to use mahout for the recommendation but getting none.
My dataset :
0,102,5.0
1,101,5.0
1,102,5.0
Code :
DataModel datamodel = new FileDataModel(new File("dataset.csv"));
// Creating UserSimilarity object.
UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
// Creating UserNeighbourHHood object.
UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel);
// Create UserRecomender
UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
List<RecommendedItem> recommendations = recommender.recommend(0, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
I am using Mahout version : 0.13.0
Ideally, it should recommend item_id = 101' to 'user_id = 0' asuser = 0anduser = 1have item 102 common show it should recommenditem_id = 101touser_id = 0`
Logs :
18:08:11.669 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Creating FileDataModel for file dataset.csv
18:08:11.700 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Reading file info...
18:08:11.702 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - Read lines: 3
18:08:11.722 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel - Processed 2 users
18:08:11.738 [main] DEBUG org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender - Recommending items for user ID '0'
The Hadoop Mapreduce code in Mahout is being deprecated. The new recommender code starts with #rawkintrevo 's examples. If you are a Scala programmer follow them.
Most Engineers would like a system that works with no modification, The Mahout algorithm is encapsulated in The Universal Recommender built on top of Apache PredictionIO. It has a server to accept events, like the ones in your example, it has internal event storage, and a query server for results. There are numerous improvements over the old Mapreduce code, including using real-time user behavior to make recommendations. Neither the new Mahout nor the old included servers for input and query, the Universal Recommender has REST endpoints for both.
Given that the code you are using will be deprecated I strongly suggest that you dive into Mahout code (#rawkintrevo's example) or look at The Universal Recommender, which is an entire end-to-end system.
Install PredictionIO with a "single machine" setup here or to really shortcut setup use our prepackaged AWS AMI here It includes PIO and The Universal Recommender pre-installed.
Add the UR Template here
A Java SDK for sending events to the recommender here
Once you have this setup you deal with config, REST or Java SDK and the PIO CLI. No Scala coding required.
I have three examples that are based on version 0.13.0 (and Scala, which is required for Samsara, the R-Like Scala DSL Mahout utilizes v0.10+)
Walk
The first example is a very slow walk through:
https://gist.github.com/rawkintrevo/3869030ff1a731d43c5e77979a5bf4a8
and is meant as a companion to Pat Ferrels blog post/slide deck found here.
http://actionml.com/blog/cco
Crawl
The second example is a little more "real" in that it utilizes the SimilarityAnalysis.cooccurrencesIDSs(... which is the propper interface for the CCO algorithm.
https://gist.github.com/rawkintrevo/c1bb00896263bdc067ddcd8299f4794c
Run
Here we use 'real' data. The MovieLens data set doesn't have enough going on to showcase CCO's multi-modal power (the ability to recommend on multiple user behaviors). Here we load 'real' data and generate recommendations.
https://gist.github.com/rawkintrevo/f87cc89f4d337d7ffea80a6af3bee83e
Conclusion
I know you specifically asked for Java, however Apache Mahout isn't geared for Java at the moment. In theory you could import Scala into your java, or maybe wrap the functions in another more Java friendly function... I've heard rumors late at night (or possibly in a dream) that some grad students some where were working on a Java API, but its not in the trunk at the moment, nor is there a PR, nor is their a bullet in the road map.
Hope the above provides some insight.
Appendix
The most trivial example for Stackoverflow (you can run this interactively in the Mahout spark shell by typing $MAHOUT_HOME/bin/mahout spark-shell (assuming SPARK_HOME, JAVA_HOME and MAHOUT_HOME are set):
val inputRDD = sc.parallelize(Array( ("u1", "purchase", "iphone"),
("u1","purchase","ipad"),
("u2","purchase","nexus"),
("u2","purchase","galaxy"),
("u3","purchase","surface"),
("u4","purchase","iphone"),
("u4","purchase","galaxy"),
("u1","category-browse","phones"),
("u1","category-browse","electronics"),
("u1","category-browse","service"),
("u2","category-browse","accessories"),
("u2","category-browse","tablets"),
("u3","category-browse","accessories"),
("u3","category-browse","service"),
("u4","category-browse","phones"),
("u4","category-browse","tablets")) )
import org.apache.mahout.math.indexeddataset.{IndexedDataset, BiDictionary}
import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
val purchasesIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "purchase").map(o => (o._1, o._3)))(sc)
val browseIDS = IndexedDatasetSpark.apply(inputRDD.filter(_._2 == "category-browse").map(o => (o._1, o._3)))(sc)
import org.apache.mahout.math.cf.SimilarityAnalysis
val llrDrmList = SimilarityAnalysis.cooccurrencesIDSs(Array(purchasesIDS, browseIDS),
randomSeed = 1234,
maxInterestingItemsPerThing = 3,
maxNumInteractions = 4)
val llrAtA = llrDrmList(0).matrix.collect
IndexedDatasetSpark.apply( requires an RDD[(String, String)] where the first string is the 'row' (e.g. users), second string is the 'behavior' so for the 'buy matrix', the columns would be 'products', but this could also be a 'gender' matrix, with two columns (male/female)
Then you pass an array of IndexedDataSets to SimilarityAnalysis.cooccurrencesIDSs(

Getting All Workitems from Team Area

I have the following objects:
ITeamRepository repo;
IProjectArea projArea;
ITeamArea teamArea;
The process of obtaining the projArea and the teamArea is quite straightforward (despite the quantity of objects involved). However I can't seem to find a way to obtain a list with all the Workitems associated with these objects in a direct way. Is this directly possible, probably via the IQueryClient objects?
This 2012 thread (so it might have changed since) suggests:
I used the following code to get the work items associated with each project area:
auditableClient = (IAuditableClient) repository.getClientLibrary(IAuditableClient.class);
IQueryClient queryClient = (IQueryClient) repository.getClientLibrary(IQueryClient.class);
IQueryableAttribute attribute = QueryableAttributes.getFactory(IWorkItem.ITEM_TYPE).findAttribute(currProject, IWorkItem.PROJECT_AREA_PROPERTY, auditableClient, null);
Expression expression = new AttributeExpression(attribute, AttributeOperation.EQUALS, currProject);
IQueryResult<IResolvedResult<IWorkItem>> results = queryClient.getResolvedExpressionResults(currProject, expression, IWorkItem.FULL_PROFILE);
In my code, currProject would be the IProjectArea pointer to the current project as you loop through the List of project areas p in your code.
The IQueryResult object 'results' then contains a list of IResolvedResult records with all of the work items for that project you can iterate through and find properties for each work item.

Neo4j's Java Algorithm binding does not work with big dataset but Cypher does

I'm trying to use findAllPaths find shortest paths between two random nodes, but the program just stuck there with no output(it works pretty fine when I doing the Unit test with a toy db which is relatively small). The CPU usage is quite low so I do not think it is doing the computing while sticking.
So I try to use the ExecutionEngine instead, and that works prefect fine and returns the result in several microseconds. I just curious why the Algorithm API does not work with a big database(8 million nodes and 300 million directed edges)? Or I made some mistake? I barely follow the tutorial on their website http://docs.neo4j.org/chunked/stable/tutorials-java-embedded-graph-algo.html, all I change is I do not specify any particular relation type in my search.
By the way, My Neo4j version is 1.9.4.
Here is the code :
while(startNode==null){
startNode = nodeFinder.getSingleNodeByIndex("wikipage", "id", random.nextInt(MAX_TUPLE));
}
while(stopNode==null){
stopNode = nodeFinder.getSingleNodeByIndex("wikipage", "id", random.nextInt(MAX_TUPLE));
}
This part works
ExecutionEngine engine = new ExecutionEngine(GraphDataBase.getInstance().getGraphDB());
ExecutionResult result = engine.execute("start source=node:wikipage(id=\""+startNode.getId()+"\"), dest=node:wikipage(id=\""+stopNode.getId()+"\") match p=allShortestPaths(source-[r:WIKI_LINK*..8]->dest) return nodes(p);");
But this does not
PathFinder<Path> finder = GraphAlgoFactory.shortestPath(Traversal.expanderForAllTypes(Direction.OUTGOING), 8 );
Iterable<Path> paths = finder.findAllPaths( startNode, stopNode );

opencv face-detection in java : conception steps

I'm working on face-detection project via webcam using opencv
In this approach (viola-jones) to detecting object in images combines four key concepts :
1-Simple rectangular features called haar features ( i can find this one in haarcascade_frontalface_alt.xml file).
2- An integral Image for raped feature detection.
3- The AdaBoost machine-learning method.
4-A cascaded classifier to combine many features efficiently.
my questions are:
-does haarcascade_frontalface_alt.xml contains the cascaded classifier also with the haar feature?
-how can i add the integral image and AdaBoost in my project and how to use it??or is it already done automatically??
it seems, you've read a lot of papers and pondered ideas, but have not found the opencv implementation ;)
using it is actually quite easy:
// setup a cascade classifier:
CascadeClassifier cascade;
// load a pretrained cascadefile(and PLEASE CHECK!):
bool ok = cascade.load("haarcascade_frontalface_alt.xml");
if ( ! ok )
{
...
}
// later, search for stuff in your img:
Mat gray; // uchar grayscale!
vector<Rect> faces; // the result vec
cascade.detectMultiScale( gray, faces, 1.1, 3,
CV_HAAR_FIND_BIGGEST_OBJECT | CV_HAAR_DO_ROUGH_SEARCH ,
cv::Size(20, 20) );
for ( size_t i=0; i<faces.size(); i++ )
{
// gray( faces[i] ); is the img portion that contains the detected object
}

Categories