Saving an H2O model directly from Java - java

I'm trying to create and save a generated model directly from Java. The documentation specifies how to do this in R and Python, but not in Java. A similar question was asked before, but no real answer was provided (beyond linking to H2O doc, which doesn't contain a code example).
It'd be sufficient for my present purpose get some pointers to be able to translate the following reference code to Java. I'm mainly looking for guidance on the relevant JAR(s) to import from the Maven repository.
import h2o
h2o.init()
path = h2o.system_file("prostate.csv")
h2o_df = h2o.import_file(path)
h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
model = h2o.glm(y = "CAPSULE",
x = ["AGE", "RACE", "PSA", "GLEASON"],
training_frame = h2o_df,
family = "binomial")
h2o.download_pojo(model)

I think I've figured out an answer to my question. A self-contained sample code follows. However, I'll still appreciate an answer from the community since I don't know if this is the best/idiomatic way to do it.
package org.name.company;
import hex.glm.GLMModel;
import water.H2O;
import water.Key;
import water.api.StreamWriter;
import water.api.StreamingSchema;
import water.fvec.Frame;
import water.fvec.NFSFileVec;
import hex.glm.GLMModel.GLMParameters.Family;
import hex.glm.GLMModel.GLMParameters;
import hex.glm.GLM;
import water.util.JCodeGen;
import java.io.*;
import java.util.Map;
public class Launcher
{
public static void initCloud(){
String[] args = new String [] {"-name", "h2o_test_cloud"};
H2O.main(args);
H2O.waitForCloudSize(1, 10 * 1000);
}
public static void main( String[] args ) throws Exception {
// Initialize the cloud
initCloud();
// Create a Frame object from CSV
File f = new File("/path/to/data.csv");
NFSFileVec nfs = NFSFileVec.make(f);
Key frameKey = Key.make("frameKey");
Frame fr = water.parser.ParseDataset.parse(frameKey, nfs._key);
// Create a GLM and output coefficients
Key modelKey = Key.make("modelKey");
try {
GLMParameters params = new GLMParameters();
params._train = frameKey;
params._response_column = fr.names()[1];
params._intercept = true;
params._lambda = new double[]{0};
params._family = Family.gaussian;
GLMModel model = new GLM(params).trainModel().get();
Map<String, Double> coefs = model.coefficients();
for(Map.Entry<String, Double> entry : coefs.entrySet()) {
System.out.format("%s: %f\n", entry.getKey(), entry.getValue());
}
String filename = JCodeGen.toJavaId(model._key.toString()) + ".java";
StreamingSchema ss = new StreamingSchema(model.new JavaModelStreamWriter(false), filename);
StreamWriter sw = ss.getStreamWriter();
OutputStream os = new FileOutputStream("/base/path/" + filename);
sw.writeTo(os);
} finally {
if (fr != null) {
fr.remove();
}
}
}
}

Would something like this do the trick?
public void saveModel(URI uri, Keyed<Frame> model)
{
Persist p = H2O.getPM().getPersistForURI(uri);
OutputStream os = p.create(uri.toString(), true);
model.writeAll(new AutoBuffer(os, true)).close();
}
Make sure the URI has a proper form otherwise H2O will break on an npe. As for Maven you should be able to get away with the h2o core.
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-core</artifactId>
<version>3.14.0.2</version>
</dependency>

Related

How to avoid getting a memory leak while copying a VectorSchemaRoot

I need to copy all of the contents of a stream of VectorSchemaRoots into a single object:
Stream<VectorSchemaRoot> data = fetchStream();
VectorSchemaRoot finalResult = VectorSchemaRoot.create(schema, allocator);
VectorLoader = new VectorLoader(finalResult);
data.forEach(current -> {
VectorUnloader unloader = new VectorUnloader(current);
ArrowRecordBatch batch = unloader.getRecordBatch();
loader.load(batch);
current.close();
})
However, I am getting the following error:
java.lang.IllegalStateException: Memory was leaked by query. Memory was leaked.
Also getting this further down the stack track:
Could not load buffers for field date: Timetamp(MILLISECOND, null) not null. error message: A buffer can only be associated between two allocators that shame the same root
I use the same allocator for everything, does anyone know why I am getting this issue?
The "leak" is probably just a side effect of the exception, because the code as written is not exception-safe. Use try-with-resources to manage the ArrowRecordBatch instead of manually calling close():
try (ArrowRecordBatch batch = unloader.getRecordBatch()) {
loader.load(batch);
}
(though, depending on what load does, this may not be enough).
I can't say much else about why you're getting the exception without seeing more code and the full stack trace.
Could you try with something like this:
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VectorLoader;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.VectorUnloader;
import org.apache.arrow.vector.ipc.message.ArrowRecordBatch;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.util.Arrays;
import java.util.Collections;
import java.util.stream.Stream;
public class StackOverFlowSolved {
public static void main(String[] args) {
try(BufferAllocator allocator = new RootAllocator()){
// load data
IntVector ageColumn = new IntVector("age", allocator);
ageColumn.allocateNew();
ageColumn.set(0, 1);
ageColumn.set(1, 2);
ageColumn.set(2, 3);
ageColumn.setValueCount(3);
Stream<VectorSchemaRoot> streamOfVSR = Collections.singletonList(VectorSchemaRoot.of(ageColumn)).stream();
// transfer data
streamOfVSR.forEach(current -> {
Field ageLoad = new Field("age",
FieldType.nullable(new ArrowType.Int(32, true)), null);
Schema schema = new Schema(Arrays.asList(ageLoad));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema,
allocator.newChildAllocator("loaddata", 0, Integer.MAX_VALUE))) {
VectorUnloader unload = new VectorUnloader(current);
try (ArrowRecordBatch recordBatch = unload.getRecordBatch()) {
VectorLoader loader = new VectorLoader(root);
loader.load(recordBatch);
}
System.out.println(root.contentToTSVString());
}
current.close();
});
}
}
}
age
1
2
3

Not able to load shapefile using GeoTools

I am trying to use GeoTools in order to load a shapefile into java and then check whether a point is located within one of the polygons in the shape file
The problem is that i am not able to load the shapefile and therefore to continue forward.
Here is my code so far:
public static void main(String[] args){
// create sample coordinate
double lon = -105.0;
double lat = 40.0;
GeometryFactory geometryFactory = new GeometryFactory(new PrecisionModel(PrecisionModel.maximumPreciseValue),8307);
Geometry point = geometryFactory.createPoint(new Coordinate(lon,lat));
//
String path = System.getProperty("user.dir") + "/continent_shp/continent_shp.shp";
File file = new File(path);
try {
Map<String, Serializable> connectParameters = new HashMap<String, Serializable>();
// load shapefile ---- does not work !!!!!!!!
connectParameters.put("url", file.toURI().toURL());
connectParameters.put("create spatial index", true);
DataStore dataStore = DataStoreFinder.getDataStore(connectParameters);
//
FeatureSource featureSource = dataStore.getFeatureSource("POLYGON");
FeatureCollection collection = (FeatureCollection) featureSource.getFeatures();
FeatureIterator iterator = collection.features();
while (iterator.hasNext()) {
Feature feature = iterator.next();
Geometry sourceGeometry = feature.getDefaultGeometry();
boolean isContained = sourceGeometry.contains(point);
System.out.println(isContained);
}
}
catch (MalformedURLException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}
}
The problem is that the dataStore variable is null after I try to load the shapefile.
Here are my imports:
import java.io.File;
import java.io.IOException;
import java.io.Serializable;
import java.net.MalformedURLException;
import java.util.HashMap;
import java.util.Map;
import org.geotools.data.DataStore;
import org.geotools.data.DataStoreFinder;
import org.geotools.data.FeatureSource;
import org.geotools.feature.Feature;
import org.geotools.feature.FeatureCollection;
import org.geotools.feature.FeatureIterator;
import com.vividsolutions.jts.geom.Coordinate;
import com.vividsolutions.jts.geom.Geometry;
import com.vividsolutions.jts.geom.GeometryFactory;
import com.vividsolutions.jts.geom.PrecisionModel;
Can anyone shed some light on this issue?
Any help would be appreciated.
Thank you.
The most likely problem is that you don't have a Shapefile Datastore implementation available on your path. Try the following method to check what stores are available:
public Map<String, DataStoreFactorySpi> fetchAvailableDataStores() {
Iterator<DataStoreFactorySpi> it = DataStoreFinder.getAllDataStores();
while (it.hasNext()) {
DataStoreFactorySpi fac = it.next();
System.out.println(fac.getDisplayName());
}
}
Another thing that can go wrong is the File to URL conversion, especially if there are spaces in the filename or path. Try using DataUtilities.fileToURL(file) instead.
This worked for me:
// load shapefile ---- does not work !!!!!!!!
connectParameters.put("url", file.toURI().toURL());
connectParameters.put("create spatial index", Boolean.TRUE);
ShapefileDataStoreFactory dataStoreFactory = new ShapefileDataStoreFactory();
ShapefileDataStore store = (ShapefileDataStore) dataStoreFactory.createNewDataStore(connectParameters);
//

SlopeOneRecommender not working

I am following the book Apache Mahout Cookbook by piero giacomelli. Now when i download the maven sources using netbeans as IDE, i guess the sources are from mahout version 1.0 and not 0.8 as it showing an error in SlopeOneRecommender import alone.
Here is the complete code -
package com.packtpub.mahout.cookbook.chapter01;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;
import org.apache.commons.cli2.OptionException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
public class App {
static final String inputFile = "/home/hadoop/ml-1m/ratings.dat";
static final String outputFile = "/home/hadoop/ml-1m/ratings.csv";
public static void main( String[] args ) throws IOException, TasteException, OptionException
{
CreateCsvRatingsFile();
// create data source (model) - from the csv file
File ratingsFile = new File(outputFile);
DataModel model = new FileDataModel(ratingsFile);
// create a simple recommender on our data
CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));
// for all users
for (LongPrimitiveIterator it = model.getUserIDs(); it.hasNext();){
long userId = it.nextLong();
// get the recommendations for the user
List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, 10);
// if empty write something
if (recommendations.size() == 0){
System.out.print("User ");
System.out.print(userId);
System.out.println(": no recommendations");
}
// print the list of recommendations for each
for (RecommendedItem recommendedItem : recommendations) {
System.out.print("User ");
System.out.print(userId);
System.out.print(": ");
System.out.println(recommendedItem);
}
}
}
private static void CreateCsvRatingsFile() throws FileNotFoundException, IOException {
BufferedReader br = new BufferedReader(new FileReader(inputFile));
BufferedWriter bw = new BufferedWriter(new FileWriter(outputFile));
String line = null;
String line2write = null;
String[] temp;
int i = 0;
while (
(line = br.readLine()) != null
&& i < 1000
){
i++;
temp = line.split("::");
line2write = temp[0] + "," + temp[1];
bw.write(line2write);
bw.newLine();
bw.flush();
}
br.close();
bw.close();
}
}
The error is being shown only on import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
and hence on the line where i create an object using this. Error being shown is package does not exist.
Please help. Is it because i am using a newer version of mahout? I am even uncertain if I am using version 0.8 or a higher version as i followed all the links given in the book.
The SlopeOneRecommender was removed from mahout since v0.8. If you want to use it, you can switch to version such as 0.7.
<dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-core</artifactId>
<version>0.7</version>
</dependency>
See http://permalink.gmane.org/gmane.comp.apache.mahout.user/20282
Exactly,
The SlopeOneRecommender was removed from mahout since v0.8. So either you get back to the version 0.7. Or if your purpose is only to try mahout u can try with other recommenders, such as :ItemAverageRecommender.

Learning Java Compiler APIs,why does trees.getElement(treepath) return null?

I'm trying to parse a java file with Java Compiler APIs.
The documents are very poor. After hours of digging I still cannot get the Trees#getElement work for me. Here's my code:
import com.sun.source.tree.*;
import com.sun.source.util.*;
import javax.tools.JavaCompiler;
import javax.tools.JavaFileObject;
import javax.tools.StandardJavaFileManager;
import javax.tools.ToolProvider;
import java.io.IOException;
import java.nio.CharBuffer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
class CodeAnalyzerTreeVisitor extends TreePathScanner<Object, Trees> {
#Override
public Object visitClass(ClassTree classTree, Trees trees) {
System.out.println("className " + classTree.getSimpleName());
//prints name of class
TreePath path = getCurrentPath();
printLocationAndSource(trees, path, classTree);
//prints the original source code
while (path != null) {
System.out.println("treepath");
System.out.println(trees.getElement(path));
path = path.getParentPath();
}
//it prints several nulls here
//why?
return super.visitClass(classTree, trees);
}
public static void printLocationAndSource(Trees trees,
TreePath path, Tree tree) {
SourcePositions sourcePosition = trees.getSourcePositions();
long startPosition = sourcePosition.
getStartPosition(path.getCompilationUnit(), tree);
long endPosition = sourcePosition.
getEndPosition(path.getCompilationUnit(), tree);
JavaFileObject file = path.getCompilationUnit().getSourceFile();
CharBuffer sourceContent = null;
try {
sourceContent = CharBuffer.wrap(file.getCharContent(true).toString().toCharArray());
} catch (IOException e) {
e.printStackTrace();
}
CharBuffer relatedSource = null;
if (sourceContent != null) {
relatedSource = sourceContent.subSequence((int) startPosition, (int) endPosition);
}
System.out.println("start: " + startPosition + " end: " + endPosition);
// System.out.println("source: "+relatedSource);
System.out.println();
}
}
public class JavaParser {
private static final JavaCompiler javac
= ToolProvider.getSystemJavaCompiler();
private static final String filePath = "/home/pinyin/Source/hadoop-common/hadoop-yarn-project/hadoop-ya" +
"rn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/ya" +
"rn/server/resourcemanager/ResourceManager.java";
public static void main(String[] args) throws IOException {
StandardJavaFileManager jfm = javac.getStandardFileManager(null, null, null);
Iterable<? extends javax.tools.JavaFileObject> javaFileObjects = jfm.getJavaFileObjects(filePath);
String[] sourcePathParam = {
"-sourcepath",
"/home/pinyin/Source/hadoop-common/hadoop-yarn-project/hadoop-yarn/" +
"hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/"
};
List<String> params = new ArrayList<String>();
params.addAll(Arrays.asList(sourcePathParam));
JavacTask task = (JavacTask) javac.getTask(null, jfm, null, params, null, javaFileObjects);
Iterable<? extends CompilationUnitTree> asts = task.parse();
Trees trees = Trees.instance(task);
for (CompilationUnitTree ast : asts) {
new CodeAnalyzerTreeVisitor().scan(ast, trees);
}
}
}
The lines about params and -sourcepath are added because I thought the compiler is trying to find the source file in the wrong places. They didn't work.
I'm still trying to understand how the Trees, javac and related JSRs work together, are there any recommended documents for beginners?
Thanks for your help.
edit:
The java file I'm trying to analyze is:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
The file can be compiled without errors in its maven project, but its dependencies are not passed to javac in my situation. I'm not sure if this is the problem.
The trees.getElement returns null in the middle part of the code above, while the other parts seems to work well.
According to this answer, it seems that the Elements' information is not usable until the compilation is completed.
So calling task.analyze() solved my problem. Although javac is complaining about missing dependencies.
Please correct me if I'm wrong, thanks.

Creating graph with Neo4j graph database takes too long

I use the following code to create a graph with Neo4j Graph Database:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.lucene.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserters;
public class Neo4jMassiveInsertion implements Insertion {
private BatchInserter inserter = null;
private BatchInserterIndexProvider indexProvider = null;
private BatchInserterIndex nodes = null;
private static enum RelTypes implements RelationshipType {
SIMILAR
}
public static void main(String args[]) {
Neo4jMassiveInsertion test = new Neo4jMassiveInsertion();
test.startup("data/neo4j");
test.createGraph("data/enronEdges.txt");
test.shutdown();
}
/**
* Start neo4j database and configure for massive insertion
* #param neo4jDBDir
*/
public void startup(String neo4jDBDir) {
System.out.println("The Neo4j database is now starting . . . .");
Map<String, String> config = new HashMap<String, String>();
inserter = BatchInserters.inserter(neo4jDBDir, config);
indexProvider = new LuceneBatchInserterIndexProvider(inserter);
nodes = indexProvider.nodeIndex("nodes", MapUtil.stringMap("type", "exact"));
}
public void shutdown() {
System.out.println("The Neo4j database is now shuting down . . . .");
if(inserter != null) {
indexProvider.shutdown();
inserter.shutdown();
indexProvider = null;
inserter = null;
}
}
public void createGraph(String datasetDir) {
System.out.println("Creating the Neo4j database . . . .");
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
String line;
int lineCounter = 1;
Map<String, Object> properties;
IndexHits<Long> cache;
long srcNode, dstNode;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split("\t");
cache = nodes.get("nodeId", parts[0]);
if(cache.hasNext()) {
srcNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[0]);
srcNode = inserter.createNode(properties);
nodes.add(srcNode, properties);
nodes.flush();
}
cache = nodes.get("nodeId", parts[1]);
if(cache.hasNext()) {
dstNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[1]);
dstNode = inserter.createNode(properties);
nodes.add(dstNode, properties);
nodes.flush();
}
inserter.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
}
lineCounter++;
}
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Comparing with other graph database technologies (titan, orientdb) it needs too much time. So may i am doing something wrong. Is there a way to boost up the procedure?
I use neo4j 1.9.5 and my machine has a 2.3 Ghz CPU (i5), 4GB RAM and 320GB disk and I am running on Macintosh OSX Mavericks (10.9). Also my heap size is at 2GB.
Usually I can import about 1M nodes and 200k relationships per second on my macbook.
Flush & Search
Please don't flush & search on every insert, that totally kills performance.
Keep your nodeIds in a HashMap from your data to node-id, and only write to lucene during the import.
(If you care about memory usage you can also go with something like gnu-trove)
RAM
Memory Mapping
You also use too little RAM (I usually use heaps between 4 and 60GB depending on the data set size) and you don't have any config set.
Please check as sensible config something like this, depending on you data volume I'd raise these numbers.
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1000M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
Heap
And make sure to give it enough heap. You might also have a disk that might be not the fastest. Try to increase your heap to at least 3GB. Also make sure to have the latest JDK, 1.7.._b25 had a memory allocation issue (it allocated only a tiny bit of memory for the

Categories