Get actual field name from JPMML model's InputField - java

I have a scikit model that I'm using in my java app using JPMML. I'm trying to set the InputFields using the name of the column that was used during training, but "inField.getName().getValue()" is obfuscated to "x{#}". Is there anyway i could map "x{#}" back to the original feature/attribute name?
Map<FieldName, FieldValue> arguments = new LinkedHashMap<>();
or (InputField inField : patternEvaluator.getInputFields()) {
int value = activeFeatures.contains(inField.getName().getValue()) ? 1 : 0;
FieldValue inputFieldValue = inField.prepare(value);
arguments.put(inField.getName(), inputFieldValue);
}
Map<FieldName, ?> results = patternEvaluator.evaluate(arguments);
Here's how I'm generating the modal
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import PMMLPipeline
import os
import pandas as pd
from sklearn.pipeline import Pipeline
import numpy as np
data = pd.read_csv('/pydata/training.csv')
X = data[data.keys()[:-1]].as_matrix()
y = data['classname'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
estimators = [("read", RandomForestClassifier(n_jobs=5,n_estimators=200, max_features='auto'))]
pipe = PMMLPipeline(estimators)
pipe.fit(X_train,y_train)
pipe.active_fields = np.array(data.columns)
sklearn2pmml(pipe, "/pydata/model.pmml", with_repr = True)
Thanks

Does the PMML document contain actual field names at all? Open it in a text editor, and see what are the values of /PMML/DataDictionary/DataField#name attributes.
Your question indicates that the conversion from Scikit-Learn to PMML was incomplete, because it didn't include information about active field (aka input field) names. In that case they are assumed to be x1, x2, .., xn.

Your pipeline only includes the estimator, that is why the names are lost. You have to include all the preprocessing steps as well in order to get them into the PMML.
Let's assume you do not do any preprocessing at all, then that is probably what you need (I do not repeat parts of your code which are required in this snippet):
nones = [(d, None) for d in data.columns]
mapper = DataFrameMapper(nones,df_out=True)
lm = PMMLPipeline([
("mapper", mapper),
("estimator", estimators)
])
lm.fit(X_train,y_train)
sklearn2pmml(lm, "ScikitLearnNew.pmml", with_repr=True)
In case you do require some preprocessing on your data, instead of None you can use any other transformator (e.g. LabelBinarizer). But the preprocessing has to be happening inside the pipeline in order to be included in the PMML.

Related

Using trained TensorFlow model in Java

I have trained a TensorFlow model in Python and would like to use it in Java code. Training the model is done via something like this code:
def input_fn():
features = {'a': tf.constant([[1],[2]]),
'b': tf.constant([[3],[4]]) }
labels = tf.constant([0, 1])
return features, labels
feature_a = tf.contrib.layers.sparse_column_with_integerized_feature("a", bucket_size=10)
feature_b = tf.contrib.layers.sparse_column_with_integerized_feature("b", bucket_size=10)
feature_columns = [feature_a, feature_b]
model = tf.contrib.learn.LinearClassifier(feature_columns=feature_columns)
model.fit(input_fn=input_fn, steps=10)
Now I want to save this model to use it in Java. It seems that export_savedmodel is the new/preferred way of saving, so I tried:
feature_spec = tf.contrib.layers.create_feature_spec_for_parsing(feature_columns)
serving_input_fn = input_fn_utils.build_parsing_serving_input_fn(feature_spec)
model.export_savedmodel('export', serving_input_fn, as_text=True)
This results in a saved model, which can be loaded from Java with
model = SavedModelBundle.load(dir, "serve");
model.session().runner()
.feed("input_example_tensor", input)
.fetch("linear/binary_logistic_head/predictions/probabilities")
.run();
There is now a problem though: the input_example_tensor should be a Tensor containing Strings/byte[]s, but this is not supported in Java yet (see: Tensor.java#88 "throw new UnsupportedOperationException"). As far as I understand it, the reason that it wants a String is that build_parsing_serving_input_fn wants to parse serialized Example protocol buffers.
Maybe a different serving_input_fn would do better. input_fn_utils.build_default_serving_input_fn looks promising, but I didn't get that to work.
If I call it like:
features_dict = {'a':feature_a, 'b':feature_b}
serving_input_fn = input_fn_utils.build_default_serving_input_fn(features)
I get "AttributeError: '_SparseColumnIntegerized' object has no attribute 'get_shape'"
If I call it like:
features = {'a': tf.constant([[1],[2]]),
'b': tf.constant([[3],[4]]) }
serving_input_fn = input_fn_utils.build_default_serving_input_fn(features)
I get "ValueError: 'Const:0' is not a valid scope name".
What is the proper way to use input_fn_utils.build_default_serving_input_fn? I can't find any example that uses it.

Cast from GrammaticalStructure to Tree

I am trying out the new NN Dependency Parser from Stanford. According to the demo they have provided, this is how the parsing is done:
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.parser.nndep.DependencyParser;
...
GrammaticalStructure gs = null;
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(sentence));
for (List<HasWord> sent : tokenizer) {
List<TaggedWord> tagged = tagger.tagSentence(sent);
gs = parser.predict(tagged);
// Print typed dependencies
System.out.println(Grammatical structure: " + gs);
}
Now, what I want to do is this object gs, which is of class GrammaticalStructure, to be casted as a Tree object from edu.stanford.nlp.trees.Tree.
I naively tried out with simple casting:
Tree t = (Tree) gs;
but, this is not possible (the IDE gives an error: Cannot cast from GrammaticalStructure to Tree).
How do I do this?
You should be able to get the Tree using gs.root().
According to the documentation, that method returns a Tree (actually, a TreeGraphNode) which represents the grammatical structure.
You could print that tree in a human-friendly way with gs.root().pennPrint().

How to add RDF triples to an OWLOntology?

I have some data coming in from a RabbitMQ. The data is formatted as triples, so a message from the queue could look something like this:
:Tom foaf:knows :Anna
where : is the standard namespace of the ontology into which I want to import the data, but other prefixes from imports are also possible. The triples consist of subject, property/predicate and object and I know in each message which is which.
On the receiving side, I have a Java program with an OWLOntology object that represents the ontology where the newly arriving triples should be stored temporarily for reasoning and other stuff.
I kind of managed to get the triples into a Jena OntModel but that's where it ends. I tried to use OWLRDFConsumer but I could not find anything about how to apply it.
My function looks something like this:
public void addTriple(RDFTriple triple) {
//OntModel model = ModelFactory.createOntologyModel();
String subject = triple.getSubject().toString();
subject = subject.substring(1,subject.length()-1);
Resource s = ResourceFactory.createResource(subject);
String predicate = triple.getPredicate().toString();
predicate = predicate.substring(1,predicate.length()-1);
Property p = ResourceFactory.createProperty(predicate);
String object = triple.getObject().toString();
object = object.substring(1,object.length()-1);
RDFNode o = ResourceFactory.createResource(object);
Statement statement = ResourceFactory.createStatement(s, p, o);
//model.add(statement);
System.out.println(statement.toString());
}
I did the substring operations because the RDFTriple class adds <> around the arguments of the triple and the constructor of Statement fails as a consequence.
If anybody could point me to an example that would be great. Maybe there's a much better way that I haven't thought of to achieve the same thing?
It seems like the OWLRDFConsumer is generally used to connect the RDF parsers with OWL-aware processors. The following code seems to work, though, as I've noted in the comments, there are a couple of places where I needed an argument and put in the only available thing I could.
The following code: creates an ontology; declares two named individuals, Tom and Anna; declares an object property, likes; and declares a data property, age. Once these are declared we print the ontology just to make sure that it's what we expect. Then it creates an OWLRDFConsumer. The consumer constructor needs an ontology, an AnonymousNodeChecker, and an OWLOntologyLoaderConfiguration. For the configuration, I just used one created by the no-argument constructor, and I think that's OK. For the node checker, the only convenient implementer is the TurtleParser, so I created one of those, passing null as the Reader. I think this will be OK, since the parser won't be called to read anything. Then the consumer's handle(IRI,IRI,IRI) and handle(IRI,IRI,OWLLiteral) methods are used to process triples one at a time. We add the triples
:Tom :likes :Anna
:Tom :age 35
and then print out the ontology again to ensure that the assertions got added. Since you've already been getting the RDFTriples, you should be able to pull out the arguments that handle() needs. Before processing the triples, the ontology contained:
<NamedIndividual rdf:about="http://example.org/Tom"/>
and afterward this:
<NamedIndividual rdf:about="http://example.org/Tom">
<example:age rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">35</example:age>
<example:likes rdf:resource="http://example.org/Anna"/>
</NamedIndividual>
Here's the code:
import java.io.Reader;
import org.coode.owlapi.rdfxml.parser.OWLRDFConsumer;
import org.semanticweb.owlapi.apibinding.OWLManager;
import org.semanticweb.owlapi.model.IRI;
import org.semanticweb.owlapi.model.OWLDataFactory;
import org.semanticweb.owlapi.model.OWLDataProperty;
import org.semanticweb.owlapi.model.OWLEntity;
import org.semanticweb.owlapi.model.OWLNamedIndividual;
import org.semanticweb.owlapi.model.OWLObjectProperty;
import org.semanticweb.owlapi.model.OWLOntology;
import org.semanticweb.owlapi.model.OWLOntologyCreationException;
import org.semanticweb.owlapi.model.OWLOntologyLoaderConfiguration;
import org.semanticweb.owlapi.model.OWLOntologyManager;
import org.semanticweb.owlapi.model.OWLOntologyStorageException;
import uk.ac.manchester.cs.owl.owlapi.turtle.parser.TurtleParser;
public class ExampleOWLRDFConsumer {
public static void main(String[] args) throws OWLOntologyCreationException, OWLOntologyStorageException {
// Create an ontology.
OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
OWLDataFactory factory = manager.getOWLDataFactory();
OWLOntology ontology = manager.createOntology();
// Create some named individuals and an object property.
String ns = "http://example.org/";
OWLNamedIndividual tom = factory.getOWLNamedIndividual( IRI.create( ns+"Tom" ));
OWLObjectProperty likes = factory.getOWLObjectProperty( IRI.create( ns+"likes" ));
OWLDataProperty age = factory.getOWLDataProperty( IRI.create( ns+"age" ));
OWLNamedIndividual anna = factory.getOWLNamedIndividual( IRI.create( ns+"Anna" ));
// Add the declarations axioms to the ontology so that the triples involving
// these are understood (otherwise the triples will be ignored).
for ( OWLEntity entity : new OWLEntity[] { tom, likes, age, anna } ) {
manager.addAxiom( ontology, factory.getOWLDeclarationAxiom( entity ));
}
// Print the the ontology to see that the entities are declared.
// The important result is
// <NamedIndividual rdf:about="http://example.org/Tom"/>
// with no properties
manager.saveOntology( ontology, System.out );
// Create an OWLRDFConsumer for the ontology. TurtleParser implements AnonymousNodeChecker, so
// it was a candidate for use here (but I make no guarantees about whether it's appropriate to
// do this). Since it won't be reading anything, we pass it a null InputStream, and this doesn't
// *seem* to cause any problem. Hopefully the default OWLOntologyLoaderConfiguration is OK, too.
OWLRDFConsumer consumer = new OWLRDFConsumer( ontology, new TurtleParser((Reader) null), new OWLOntologyLoaderConfiguration() );
// The consumer handles (IRI,IRI,IRI) and (IRI,IRI,OWLLiteral) triples.
consumer.handle( tom.getIRI(), likes.getIRI(), anna.getIRI() );
consumer.handle( tom.getIRI(), age.getIRI(), factory.getOWLLiteral( 35 ));
// Print the ontology to see the new object and data property assertions. The import contents is
// still Tom:
// <NamedIndividual rdf:about="http://example.org/Tom">
// <example:age rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">35</example:age>
// <example:likes rdf:resource="http://example.org/Anna"/>
// </NamedIndividual>
manager.saveOntology( ontology, System.out );
}
}
In ONT-API, which is an extended Jena-based implementation of OWL-API, it is quite simple:
OWLOntologyManager manager = OntManagers.createONT();
OWLOntology ontology = manager.createOntology(IRI.create("http://example.com#test"));
((Ontology)ontology).asGraphModel().createResource("http://example.com#clazz1").addProperty(RDF.type, OWL.Class);
ontology.axioms(AxiomType.DECLARATION).forEach(System.out::println);
For more information see ONT-API wiki, examples

Scala macros and the JVM's method size limit

I'm replacing some code generation components in a Java program with Scala macros, and am running into the Java Virtual Machine's limit on the size of the generated byte code for individual methods (64 kilobytes).
For example, suppose we have a large-ish XML file that represents a mapping from integers to integers that we want to use in our program. We want to avoid parsing this file at run time, so we'll write a macro that will do the parsing at compile time and use the contents of the file to create the body of our method:
import scala.language.experimental.macros
import scala.reflect.macros.Context
object BigMethod {
// For this simplified example we'll just make some data up.
val mapping = List.tabulate(7000)(i => (i, i + 1))
def lookup(i: Int): Int = macro lookup_impl
def lookup_impl(c: Context)(i: c.Expr[Int]): c.Expr[Int] = {
import c.universe._
val switch = reify(new scala.annotation.switch).tree
val cases = mapping map {
case (k, v) => CaseDef(c.literal(k).tree, EmptyTree, c.literal(v).tree)
}
c.Expr(Match(Annotated(switch, i.tree), cases))
}
}
In this case the compiled method would be just over the size limit, but instead of a nice error saying that, we're given a giant stack trace with a lot of calls to TreePrinter.printSeq and are told that we've slain the compiler.
I have a solution that involves splitting the cases into fixed-sized groups, creating a separate method for each group, and adding a top-level match that dispatches the input value to the appropriate group's method. It works, but it's unpleasant, and I'd prefer not to have to use this approach every time I write a macro where the size of the generated code depends on some external resource.
Is there a cleaner way to tackle this problem? More importantly, is there a way to deal with this kind of compiler error more gracefully? I don't like the idea of a library user getting an unintelligible "That entry seems to have slain the compiler" error message just because some XML file that's being processed by a macro has crossed some (fairly low) size threshhold.
Imo putting data into .class isn't really a good idea.
They are parsed as well, they're just binary. But storing them in JVM may have negative impact on performance of the garbagge collector and JIT compiler.
In your situation, I would pre-compile the XML into a binary file of proper format and parse that. Elligible formats with existing tooling can be e.g. FastRPC or good old DBF. Or maybe pre-fill an ElasticSearch repository if you need quick advanced lookups and searches. Some implementations of the latter may also provide basic indexing which could even leave the parsing out - the app would just read from the respective offset.
Since somebody has to say something, I followed the instructions at Importers to try to compile the tree before returning it.
If you give the compiler plenty of stack, it will correctly report the error.
(It didn't seem to know what to do with the switch annotation, left as a future exercise.)
apm#mara:~/tmp/bigmethod$ skalac bigmethod.scala ; skalac -J-Xss2m biguser.scala ; skala bigmethod.Test
Error is java.lang.RuntimeException: Method code too large!
Error is java.lang.RuntimeException: Method code too large!
biguser.scala:5: error: You ask too much of me.
Console println s"5 => ${BigMethod.lookup(5)}"
^
one error found
as opposed to
apm#mara:~/tmp/bigmethod$ skalac -J-Xss1m biguser.scala
Error is java.lang.StackOverflowError
Error is java.lang.StackOverflowError
biguser.scala:5: error: You ask too much of me.
Console println s"5 => ${BigMethod.lookup(5)}"
^
where the client code is just that:
package bigmethod
object Test extends App {
Console println s"5 => ${BigMethod.lookup(5)}"
}
My first time using this API, but not my last. Thanks for getting me kickstarted.
package bigmethod
import scala.language.experimental.macros
import scala.reflect.macros.Context
object BigMethod {
// For this simplified example we'll just make some data up.
//final val size = 700
final val size = 7000
val mapping = List.tabulate(size)(i => (i, i + 1))
def lookup(i: Int): Int = macro lookup_impl
def lookup_impl(c: Context)(i: c.Expr[Int]): c.Expr[Int] = {
def compilable[T](x: c.Expr[T]): Boolean = {
import scala.reflect.runtime.{ universe => ru }
import scala.tools.reflect._
//val mirror = ru.runtimeMirror(c.libraryClassLoader)
val mirror = ru.runtimeMirror(getClass.getClassLoader)
val toolbox = mirror.mkToolBox()
val importer0 = ru.mkImporter(c.universe)
type ruImporter = ru.Importer { val from: c.universe.type }
val importer = importer0.asInstanceOf[ruImporter]
val imported = importer.importTree(x.tree)
val tree = toolbox.resetAllAttrs(imported.duplicate)
try {
toolbox.compile(tree)
true
} catch {
case t: Throwable =>
Console println s"Error is $t"
false
}
}
import c.universe._
val switch = reify(new scala.annotation.switch).tree
val cases = mapping map {
case (k, v) => CaseDef(c.literal(k).tree, EmptyTree, c.literal(v).tree)
}
//val res = c.Expr(Match(Annotated(switch, i.tree), cases))
val res = c.Expr(Match(i.tree, cases))
// before returning a potentially huge tree, try compiling it
//import scala.tools.reflect._
//val x = c.Expr[Int](c.resetAllAttrs(res.tree.duplicate))
//val y = c.eval(x)
if (!compilable(res)) c.abort(c.enclosingPosition, "You ask too much of me.")
res
}
}

Mahout : To read a custom input file

I was playing with Mahout and found that the FileDataModel accepts data in the format
userId,itemId,pref(long,long,Double).
I have some data which is of the format
String,long,double
What is the best/easiest method to work with this dataset on Mahout?
One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.
For example, assuming you have an initialized MemoryIDMigrator, you could do this:
#Override
protected long readUserIDFromString(String stringID) {
long result = memoryIDMigrator.toLongID(stringID);
memoryIDMigrator.storeMapping(result, stringID);
return result;
}
This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).
userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.
Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.
In python:
import sys
import sys
next_id = 0
str_to_id = {}
for line in sys.stdin:
fields = line.strip().split(',')
this_id = str_to_id.get(fields[0])
if this_id is None:
next_id += 1
this_id = next_id
str_to_id[fields[0]] = this_id
fields[0] = str(this_id)
print ','.join(fields)

Categories