WEKA - Classifying New Data from Java - IDF Transform - java

We are trying to implement a WEKA classifier from inside a Java program. So far so good, everything works well however when building the classifier from the training set in Weka GUI we used the StringToWordVector IDF transform to help improve classification accuracy.
How, from within Java for new instances do I calculate the IDF transform to set for each token value in the new instance before passing the instance to the classifier?
The basic code looks like this:
Instances ins = vectorize(msg);
Instances unlabeled = new Instances(train,1);
Instance inst = new Instance(unlabeled.numAttributes());
String tmp = "";
for(int i=0; i < ins.numAttributes(); i++) {
tmp = ins.attribute(i).name();
if(unlabeled.attribute(tmp)!=null)
inst.setValue(unlabeled.attribute(tmp), 1.0); //TODO: Need to figure out the IDF transformed value to put here NOT 1!!
}
unlabeled.add(inst);
unlabeled.setClassIndex(classIdx);
.....cl.distributionForInstance(unlabeled.instance(i));
So how do I go about coding this so that I put the correct value in the new instance I want to classify?
Just to be clear the line inst.setValue(unlabeled.attribute(tmp), 1.0); needs to be changed from 1.0 to the IDF transformed number...

You need to use FilteredClassifier for this purpose. The code snippet is :
StringToWordVector strWVector = new StringToWordVector();
filteredClassifier fcls = new FilteredClassifier();
fcls.setFilter(strWVector);
fcls.setClassifier(new SMO());
fcls.buildClassifier(yourdata)
//rest of your code
This is much easier as you can pass your instances all at once.FilteredClassifier takes care of all other details. The code is not tested but it will get you started.
Edit : You can do in the following way too. This is code snippet from weka tutorial See http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Filter-Filtering%20on-the-fly Batch Mode for details
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test se
HTH

Related

Returning value from persisted Groovy code without using java.io.File

In my end product, I provide the ability to extend the application code at runtime using small Groovy scripts that are edited via a form and whose code are persisted in the SQL database.
The scheme that these "custom code" snippets follow is generally returning a value based on input parameters. For example, during the invoicing of a service, the rating system might use a published schedule of predetermined rates, or values defined in a contract in the application, through custom groovy code, if an "overridden" value is returned, then it should be used.
In the logic that would determine the "override" value of the rate, I've incorporated something like these groovy code snippets that return a value, or if they return null, then the default value is used. E.g.
class GroovyRunner {
static final GroovyClassLoader classLoader = new GroovyClassLoader()
static final String GROOVY_CODE = MyDatabase().loadCustomCode()
static final String GROOVY_CLASS = MyDatabase().loadCustomClassName()
static final String TEMPDIR = System.getProperty("java.io.tmpdir")
double getOverrideRate(Object inParameters) {
def file = new File(TEMPDIR+GROOVY_CLASS+".groovy")
BufferedWriter bw = new BufferedWriter(new FileWriter(file))
bw.write(GROOVY_CODE)
bw.close()
Class gvy = classLoader.parseClass(file)
GroovyObject obj = (GroovyObject) gvy.getDeclaredConstructor().newInstance()
return Double.valueOf(obj.invokeMethod("getRate",inParameters)
}
}
And then, in the user-created custom groovy code:
class RateInterceptor {
def getRate(Object inParameters) {
def businessEntity = (SomeClass) inParameters
return businessEntity.getDiscount() == .5 ? .5 : null
}
}
The problem with this is that these "custom code" bits in GROOVY_CODE above, are pulled from a database during runtime, and contain an intricate groovy class. Since this method will be called numerous times in succession, it is impractical to create a new File object each time it is run.
Whether I use GroovyScriptEngine, or the GroovyClassLoader, these both involve the need of a java.io.File object. This makes the code execute extremely slowly, as the File will have to be created after the custom groovy code is retrieved from the database. Is there any way to run groovy code that can return a value without creating a temporary file to execute it?
The straight solution for your case would be using GroovyClassLoader.parseClass​(String text)
http://docs.groovy-lang.org/latest/html/api/groovy/lang/GroovyClassLoader.html#parseClass(java.lang.String)
The class caching should not be a problem because you are creating each time a new GroovyClassLoader
However think about using groovy scripts instead of classes
your rate interceptor code could be like this:
def businessEntity = (SomeClass) context
return businessEntity.getDiscount() == .5 ? .5 : null
or even like this:
context.getDiscount() == .5 ? .5 : null
in script you could declare functions, inner classes, etc
so, if you need the following script will work also:
class RateInterceptor {
def getRate(SomeClass businessEntity) {
return businessEntity.getDiscount() == .5 ? .5 : null
}
}
return new RateInterceptor().getRate(context)
The java code to execute those kind of scripts:
import groovy.lang.*;
...
GroovyShell gs = new GroovyShell();
Script script = gs.parse(GROOVY_CODE);
// bind variables
Binding binding = new Binding();
binding.setVariable("context", inParams);
script.setBinding(binding);
// run script
Object ret = script.run();
Note that parsing of groovy code (class or script) is a heavy operation. And if you need to speedup your code think about caching of parsed class into some in-memory cache or even into a map
Map<String, Class<groovy.lang.Script>>
Straight-forward would be also:
GroovyShell groovyShell = new GroovyShell()
Closure groovy = { String name, String code ->
String script = "{ Map params -> $code }"
groovyShell.evaluate( script, name ) as Closure
}
def closure = groovy( 'SomeName', 'params.someVal.toFloat() * 2' )
def res = closure someVal:21
assert 42.0f == res

Using trained TensorFlow model in Java

I have trained a TensorFlow model in Python and would like to use it in Java code. Training the model is done via something like this code:
def input_fn():
features = {'a': tf.constant([[1],[2]]),
'b': tf.constant([[3],[4]]) }
labels = tf.constant([0, 1])
return features, labels
feature_a = tf.contrib.layers.sparse_column_with_integerized_feature("a", bucket_size=10)
feature_b = tf.contrib.layers.sparse_column_with_integerized_feature("b", bucket_size=10)
feature_columns = [feature_a, feature_b]
model = tf.contrib.learn.LinearClassifier(feature_columns=feature_columns)
model.fit(input_fn=input_fn, steps=10)
Now I want to save this model to use it in Java. It seems that export_savedmodel is the new/preferred way of saving, so I tried:
feature_spec = tf.contrib.layers.create_feature_spec_for_parsing(feature_columns)
serving_input_fn = input_fn_utils.build_parsing_serving_input_fn(feature_spec)
model.export_savedmodel('export', serving_input_fn, as_text=True)
This results in a saved model, which can be loaded from Java with
model = SavedModelBundle.load(dir, "serve");
model.session().runner()
.feed("input_example_tensor", input)
.fetch("linear/binary_logistic_head/predictions/probabilities")
.run();
There is now a problem though: the input_example_tensor should be a Tensor containing Strings/byte[]s, but this is not supported in Java yet (see: Tensor.java#88 "throw new UnsupportedOperationException"). As far as I understand it, the reason that it wants a String is that build_parsing_serving_input_fn wants to parse serialized Example protocol buffers.
Maybe a different serving_input_fn would do better. input_fn_utils.build_default_serving_input_fn looks promising, but I didn't get that to work.
If I call it like:
features_dict = {'a':feature_a, 'b':feature_b}
serving_input_fn = input_fn_utils.build_default_serving_input_fn(features)
I get "AttributeError: '_SparseColumnIntegerized' object has no attribute 'get_shape'"
If I call it like:
features = {'a': tf.constant([[1],[2]]),
'b': tf.constant([[3],[4]]) }
serving_input_fn = input_fn_utils.build_default_serving_input_fn(features)
I get "ValueError: 'Const:0' is not a valid scope name".
What is the proper way to use input_fn_utils.build_default_serving_input_fn? I can't find any example that uses it.

Array of associate arrays in JAVA

$postfields["pricing[1][annually]"] = "50.00";
$postfields["pricing[1][monthly]"] = "50.00";
$postfields["pricing[2][monthly]"] = "8.00";
$postfields["pricing[2][annually]"] = "80.00";
I want something similar to the above variable in java. I am not talking about creating a class with required variables.
I have used List<Map<String,String>> pricing = new ArrayList<Map<String,String>>();
but that doesn't seem to work with WHMCS api.
I debugged and came across this value on the back-end
"pricing" -> "[{monthly=5.00, annually=50.00}]"
That is how it is done in the api:
http://docs.whmcs.com/API:Add_Product
Do we have anything similar in java that can cater this issue?
I am integrating a billing solution with WHMCS using their api.
You can use a simple 2D double array. You'll create some int constants for annually = 0, monthly = 1
double pricing = new double[][]{50.0, 50.0};
and so on

Different types of Groovy execution in Java

I have 3 questions regarding using Groovy in java. They are all related together so I'm only creating one question here.
1) There are: GroovyClassLoader, GroovyShell, GroovyScriptEngine. But what is the difference between using them?
For example for this code:
static void runWithGroovyShell() throws Exception {
new GroovyShell().parse(new File("test.groovy")).invokeMethod("hello_world", null);
}
static void runWithGroovyClassLoader() throws Exception {
Class scriptClass = new GroovyClassLoader().parseClass(new File("test.groovy"));
Object scriptInstance = scriptClass.newInstance();
scriptClass.getDeclaredMethod("hello_world", new Class[]{}).invoke(scriptInstance, new Object[]{});
}
static void runWithGroovyScriptEngine() throws Exception {
Class scriptClass = new GroovyScriptEngine(".").loadScriptByName("test.groovy");
Object scriptInstance = scriptClass.newInstance();
scriptClass.getDeclaredMethod("hello_world", new Class[]{}).invoke(scriptInstance, new Object[]{});
}
2) What is the best way to load groovy script so it remains in memory in compiled form, and then I can call a function in that script when I need to.
3) How do I expose my java methods/classes to groovy script so that it can call them when needed?
Methods 2 and 3 both return the parsed class in return. So you can use a map to keep them in memory once they are parsed and successfully loaded.
Class scriptClass = new GroovyClassLoader().parseClass(new File("test.groovy"));
map.put("test.groovy",scriptClass);
UPDATE:
GroovyObject link to the groovy object docs.
Also this is possible to cast the object directly as GroovyObject and other java classes are indistinguishable.
Object aScript = clazz.newInstance();
MyInterface myObject = (MyInterface) aScript;
myObject.interfaceMethod();
//now here you can also cache the object if you want to
Cannot comment on efficiency. But I guess if you keep the loaded classes in memory, one time parsing would not hurt much.
UPDATE For efficiency:
You should use GroovyScriptEngine, it uses script caching internally.
Here is the link: Groovy Script Engine
Otherwise you can always test it using some performance benchmarks yourself and you would get rough idea. For example: Compiling groovy scripts with all three methods in three different loops and see which performs better. Try using same and different scripts, to see if caching kicks in, in some way.
UPDATE FOR PASSING PARAMS TO AND FROM SCRIPT
Binding class will help you sending the params to and from the script.
Example Link
// setup binding
def binding = new Binding()
binding.a = 1
binding.setVariable('b', 2)
binding.c = 3
println binding.variables
// setup to capture standard out
def content = new StringWriter()
binding.out = new PrintWriter(content)
// evaluate the script
def ret = new GroovyShell(binding).evaluate('''
def c = 9
println 'a='+a
println 'b='+b
println 'c='+c
retVal = a+b+c
a=3
b=2
c=1
''')
// validate the values
assert binding.a == 3
assert binding.getVariable('b') == 2
assert binding.c == 3 // binding does NOT apply to def'd variable
assert binding.retVal == 12 // local def of c applied NOT the binding!
println 'retVal='+binding.retVal
println binding.variables
println content.toString()

How to merge two sets of weka Instances together

Currently, I'm copying one instance at a time from one dataset to the other. Is there a way to do this so that string mappings remain intact? The mergeInstances works horizontally, is there an equivalent vertical merge?
This is one step of a loop I use to read datasets of the same structure from multiple arff files into one large dataset. There has got to be a simpler way.
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
for (int i = 0; i < iNew.numInstances(); i++) {
Instance nInst = iNew.instance(i);
inst.add(nInst);
}
If you want a totally fully automated method that also copy properly string and nominal attributes, you can use the following function:
public static Instances merge(Instances data1, Instances data2)
throws Exception
{
// Check where are the string attributes
int asize = data1.numAttributes();
boolean strings_pos[] = new boolean[asize];
for(int i=0; i<asize; i++)
{
Attribute att = data1.attribute(i);
strings_pos[i] = ((att.type() == Attribute.STRING) ||
(att.type() == Attribute.NOMINAL));
}
// Create a new dataset
Instances dest = new Instances(data1);
dest.setRelationName(data1.relationName() + "+" + data2.relationName());
DataSource source = new DataSource(data2);
Instances instances = source.getStructure();
Instance instance = null;
while (source.hasMoreElements(instances)) {
instance = source.nextElement(instances);
dest.add(instance);
// Copy string attributes
for(int i=0; i<asize; i++) {
if(strings_pos[i]) {
dest.instance(dest.numInstances()-1)
.setValue(i,instance.stringValue(i));
}
}
}
return dest;
}
Please note that the following conditions should hold (there are not checked in the function):
Datasets must have the same attributes structure (number of attributes, type of attributes)
Class index has to be the same
Nominal values have to exactly correspond
To modify on the fly the values of the nominal attributes of data2 to match the ones of data1, you can use:
data2.renameAttributeValue(
data2.attribute("att_name_in_data2"),
"att_value_in_data2",
"att_value_in_data1");
Why not make a new ARFF file which has the data from both of the originals? A simple
cat 1.arff > tmp.arff
tail -n+20 2.arff >> tmp.arff
where 20 is replaced by however many lines long your arff header is. This would then produce a new arff file with all of the desired instances, and you could read this new file with your existing code:
Instances iNew = new ConverterUtils.DataSource(name).getDataSet();
You could also invoke weka on the command line using this documentation: http://old.nabble.com/how-to-merge-two-data-file-a.arff-and-b.arff-into-one-data-list--td22890856.html
java weka.core.Instances append filename1 filename2 > output-file
However, there is no function in the documentation http://weka.sourceforge.net/doc.dev/weka/core/Instances.html#main%28java.lang.String which will allow you to append multiple arff files natively within your java code. As of Weka 3.7.6, the code that appends two arff files is this:
// read two files, append them and print result to stdout
else if ((args.length == 3) && (args[0].toLowerCase().equals("append"))) {
DataSource source1 = new DataSource(args[1]);
DataSource source2 = new DataSource(args[2]);
String msg = source1.getStructure().equalHeadersMsg(source2.getStructure());
if (msg != null)
throw new Exception("The two datasets have different headers:\n" + msg);
Instances structure = source1.getStructure();
System.out.println(source1.getStructure());
while (source1.hasMoreElements(structure))
System.out.println(source1.nextElement(structure));
structure = source2.getStructure();
while (source2.hasMoreElements(structure))
System.out.println(source2.nextElement(structure));
}
Thus it looks like Weka itself simply iterates through all of the instances in a data set and prints them, the same process your code uses.
Another possible solution is to use addAll from java.util.AbstractCollection, since Instances implement it.
instances1.addAll(instances2);
I've just shared an extended weka.core.Instaces class with methods like innerJoin, leftJoin, fullJoin, update and union.
table1.makeIndex(table1.attribute("Continent_ID");
table2.makeIndex(table2.attribute("Continent_ID");
Instances result = table1.leftJoin(table2);
Instances can have different number of attributes, levels of NOMINAL and STRING variables are merged together if neccesary.
Sources and some examples are here on GitHub: weka.join.

Categories