How to get results from WEKA - java

I understand how to use WEKA APIs I first load the arff into the program which creates Instances. These will then be given to a Classifier that has been trained on this Dataset. Now I want to give it a new test dataset without a label and make the WEKA API tell me what the label for that instance is or may be. How is that done?

You use Classifier.classifyInstance(Instance)
http://weka.sourceforge.net/doc/weka/classifiers/Classifier.html

Your training and test instances should look exactly the same.
feature value 1, feature value 2......., feature value n, class value
feature value 1, feature value 2......., feature value n, class value
When you are applying your model on your test set, Weka will not provide your model the class value of the instances. Rather it will ask, "hey, classifier, let me see how you assign classes to each of the test instances as you learned from training set". Then the classifier model assigns each test instance a class from what it learned from training set. Weka then compares it and provides result in terms of precision, recall, f-score, ROC, AUC, errors, etc. So, in summary, your test instance will have the class values. Don't exclude that. Otherwise, you will get an error like "training and test sets are incompatible".

Related

DMN - matching a Java enum by a FEEL expression

I have a Java enum as an input in a DMN decision table. The DMN call is embedded directly in the Java app. So take some enum:
public enum Foo {
ONE, TWO
}
I pass an instance of this enum as an input - dmnContext.set("Foo", foo);
I hoped to be able to set a decision table input for foo of type string, and have a rule that matched "ONE". However, this doesn't work, because there is no POJO-String conversion. In the Java code, I could store foo as a String and validate it against the enumerated values (i.e. check foo is in the set ["ONE", "TWO"]), but this will complicate other parts of the application.
How can I achieve this while still using an enum type?
Please refer to this existing JIRA record comment section, for the explanation about:
why you are experiencing that behaviour
and why you should convert your Java-enum to the expected DMN type (which I guess) is a FEEL:string , and not an enum
You can use Jackson to achieve this, instead of resorting to custom code or DMN model modification.
Don't hesitate to Subscribe to the JIRA linked above, as we're hoping of making that work out-of-the-box; but is not trivial since the DMN RTF is thinking about introducing Enumerations directly in DMN eventually, so we need to take into account today what might happen tomorrow.
Since you are linking to Red Hat Product documentation, a reminder that you are strongly encouraged to open a Customer Portal ticket at https://access.redhat.com/support/cases/#/ if you have a Subscription.
I will appreciate your feedback following there references/pointers and I hope those helps

Perform Linear Regression on data (from .arff file) - JAVA, Weka

I want to perform Linear Regression on a collection of data using Java. I have couple of questions..
what data types does linear regression method accept?
Because, I have tried to load the data in pure nominal format as well as numeric, but then when i'm trying to pass that 'data' (an Instance Variable created in program) to Linear Regression it gives me this exception. Cannot handle Multi-Valued nominal class
How to be able to print the Linear Regression output to console in java. I'm unable to produce the code to do so, after going through the predefined LinearRegression.java class, i got to know that buildClassifier() is the method that takes 'data' as input file. But then i'm unable to move forward. Can anyone help me understand the sequence of steps to follow to be able to get output to console.
protected static void useLinearRegression(Instances data) throws Exception{
BufferedReader reader = new BufferedReader(new FileReader("c:\somePath\healthCare.arff"));
Instances data = new Instances(reader);
data1.setClassIndex(data1.numAttributes() - 1);
LinearRegression2 rl=new LinearRegression2();
rl.buildClassifier(data); //What after this? or before
Linear Regression should accept both nominal and numeric data types. It is simply that the target class cannot be a nominal data type.
The Model's toString() method should be able to spit out the model (other classifier options may also be required depending on your needs), but if you are also after the predictions and summaries, you may also need an Evaluation object. There, you could use toSummaryString() or toMatrixString() to obtain some other statistics about the model that was generated.
Hope this Helps!

Convert a string into a variable in java

I am building a DAQ in a Java based Platform called KMax. This platform, has a design interface to use objects like histograms. Each histogram has a name, which is declared on the design interface.
To call the histogram in the code you have to use
hist = tlsh.getKmaxHist("DATA");
The string DATA is the name that the user gives in the design interface and hist is the variable that refers to the object. Every histogram object has certain classes it can use. For instance hist.getSum() gives the total sum of the histogram.
In my DAQ I have many histograms. My plan is to create a slider box that will pick the histogram, that the user wants to apply some functions(such as getSum()). The slider box has a class(string getProperty("VALUE")) that returns the value that the user has selected.
The plan is to use something like sliderBox.getProperty("VALUE").getSum(). Of course something like that is not valid, therefore I was wondering if there is a way to "convert" the string that the getProperty() returns, into a variable already defined in the code.
Sounds like a Map will do what you need. You can put the histograms in a Map keyed by whatever the property value is.
Map<String,Histogram> histograms = new HashMap<String,Histogram>();
histograms.put("PropertyValue1", histogram1);
histograms.put("PropertyValue2", histogram2);
String desiredHistogram = silderBox.getProperty("VALUE");
Histogram histogramToUse = histograms.get(desiredHistogram);
histogramToUse.getSum(); // do whatever you need to with this
You'll want to check for nulls and all that stuff too.
It looks to me like you need a Map<String, Histogram>. Variable names are lost when java code gets compiled.
You can use the *BeanInfo class mechanism. For instance. Having a class Hist, one can write a HistBeanInfo with a "sum" property. Though these classes were intended for GUI builders with components on palettes listing heterogene properties, one can use them indepedantly.
The BeanInfo classes might be generated.
This still is a far way to actually instrument that information, maybe using reflection.
An alternative to BeanInfo would be using home-brew annotations, but with BeanInfo you have an API supported by some IDEs.
Store it in a map:
yourHistogramsMap.get(sliderBox.getProperty("VALUE")).getSum();
Of course, you have to store your histograms there first.

Structural design pattern

I'm working with three separate classes: Group, Segment and Field. Each group is a collection of one or more segments, and each segment is a collection of one or more fields. There are different types of fields that subclass the Field base class. There are also different types of segments that are all subclasses of the Segment base class. The subclasses define the types of fields expected in the segment. In any segment, some of the fields defined must have values inputted, while some can be left out. I'm not sure where to store this metadata (whether a given field in a segment is optional or mandatory.)
What is the most clean way to store this metadata?
I'm not sure you are giving enough information about the complete application to get the best answer. However here are some possible approaches:
Define an isValid() method in your base class, which by default returns true. In your subclasses, you can code specific logic for each Segment or FieldType to return false if any requirements are missing. If you want to report an error message to say which fields are missing, you could add a List argument to the isValid method to allow each type to report the list of missing values.
Use Annotations (as AlexR said above).
The benefit of the above 2 approaches is that meta data is within the code, tied directly to the objects that require it. The disadvantage is that if you want to change the required fields, you will need to update the code and deploy a new build.
If you need something which can be changed on the fly, then Gangus suggestion of Xml is a good start, because your application could reload the Xml definition at run-time and produce different validation results.
I think, the best placement for such data will be normal XML file. And for work with such data the best structure will be also XMLDOM with XPATH. Work with classes will be too complicated.
Since java 5 is released this kind of metadata can be stored using annotations. Define your own annotation #MandatoryField and mark all mandatory fields with it. Then you can discover object field-by-field using reflection and check whether not initiated fields are mandatory and throw exception in this case.

Persisting data suited for enums

Most projects have some sort of data that are essentially static between releases and well-suited for use as an enum, like statuses, transaction types, error codes, etc. For example's sake, I'll just use a common status enum:
public enum Status {
ACTIVE(10, "Active");
EXPIRED(11, "Expired");
/* other statuses... */
/* constructors, getters, etc. */
}
I'd like to know what others do in terms of persistence regarding data like these. I see a few options, each of which have some obvious advantages and disadvantages:
Persist the possible statuses in a status table and keep all of the possible status domain objects cached for use throughout the application
Only use an enum and don't persist the list of available statuses, creating a data consistency holy war between me and my DBA
Persist the statuses and maintain an enum in the code, but don't tie them together, creating duplicated data
My preference is the second option, although my DBA claims that our end users might want to access the raw data to generate reports, and not persisting the statuses would lead to an incomplete data model (counter-argument: this could be solved with documentation).
Is there a convention that most people use here? What are peoples' experiences with each and are there other alternatives?
Edit:
After thinking about it for a while, my real persistence struggle comes with handling the id values that are tied to the statuses in the database. These values would be inserted as default data when installing the application. At this point they'd have ids that are usable as foreign keys in other tables. I feel like my code needs to know about these ids so that I can easily retrieve the status objects and assign them to other objects. What do I do about this? I could add another field, like "code", to look stuff up by, or just look up statuses by name, which is icky.
We store enum values using some explicit string or character value in the database. Then to go from database value back to enum we write a static method on the enum class to iterate and find the right one.
If you expect a lot of enum values, you could create a static mapping HashMap<String,MyEnum> to translate quickly.
Don't store the actual enum name (i.e. "ACTIVE" in your example) because that's easily refactored by developers.
I'm using a blend of the three approaches you have documented...
Use the database as the authoritative source for the Enum values. Store the values in a 'code' table of some sort. Each time you build, generate a class file for the Enum to be included in your project.
This way, if the enum changes value in the database, your code will be properly invalidated and you will receive appropriate compile errors from your Continuous Integration server. You have a strongly typed binding to your enumerated values in the database, and you don't have to worry about manually syncing the values between code and the data.
Joshua Bloch gives an excellent explanation of enums and how to use them in his book "Effective Java, Second Edition" (p.147)
There you can find all sorts of tricks how to define your enums, persist them and how to quickly map them between the database and your code (p.154).
During a talk at the Jazoon 2007, Bloch gave the following reasons to use an extra attribute to map enums to DB fields and back: An enum is a constant but code isn't. To make sure that a developer editing the source can't accidentally break the DB mapping by reordering the enums or renaming then, you should add a specific attribute (like "dbName") to the enum and use that to map it.
Enums have an intrinsic id (which is used in the switch() statement) but this id changes when you change the order of elements (for example by sorting them or by adding elements in the middle).
So the best solution is to add a toDB() and fromDB() method and an additional field. I suggest to use short, readable strings for this new field, so you can decode a database dump without having to look up the enums.
While I am not familiar with the idea of "attributes" in Java (and I don't know what language you're using), I've generally used the idea of a code table (or domain specific tables) and I've attributed my enum values with more specific data, such as human readable strings (for instance, if my enum value is NewStudent, I would attribute it with "New Student" as a display value). I then use Reflection to examine the data in the database and insert or update records in order to bring them in line with my code, using the actual enum value as the key ID.
What I used in several occations is to define the enum in the code and a storage representation in the persistence layer (DB, file, etc.) and then have conversion methods to map them to each other. These conversion methods need only be used when reading from or writing to the persistent store and the application can use the type safe enums everywhere. In the conversion methods I used switch statements to do the mapping. This allows also to throw an exception if a new or unknown state is to be converted (usually because either the app or the data is newer than the other and new or additional states had been declared).
If there's at least a minor chance that list of values will need to be updated than it's 1. Otherwise, it's 3.
Well we don't have a DBA to answer to, so our preference is for option 2).
We simply save the Enum value into the database, and when we are loading data out of the database and into our Domain Objects, we just cast the integer value to the enum type.
This avoids any of the synchronisation headaches with options 1) and 3). The list is defined once - in the code.
However, we have a policy that nobody else accesses the database directly; they must come through our web services to access any data. So this is why it works well for us.
In your database, the primary key of this "domain" table does't have to be a number. Just use a varchar pk and a description column (for the purposes your dba is concerned). If you need to guarantee the ordering of your values without relying on the alphabetical sor, just add a numeric column named "order or "sequence".
In your code, create a static class with constants whose name (camel-cased or not) maps to the description and value maps to the pk. If you need more than this, create a class with the necessary structure and comparison operators and use instances of it as the value of the constants.
If you do this too much, build a script to generate the instatiation / declaration code.

Categories