Perform Linear Regression on data (from .arff file) - JAVA, Weka - java

I want to perform Linear Regression on a collection of data using Java. I have couple of questions..
what data types does linear regression method accept?
Because, I have tried to load the data in pure nominal format as well as numeric, but then when i'm trying to pass that 'data' (an Instance Variable created in program) to Linear Regression it gives me this exception. Cannot handle Multi-Valued nominal class
How to be able to print the Linear Regression output to console in java. I'm unable to produce the code to do so, after going through the predefined LinearRegression.java class, i got to know that buildClassifier() is the method that takes 'data' as input file. But then i'm unable to move forward. Can anyone help me understand the sequence of steps to follow to be able to get output to console.
protected static void useLinearRegression(Instances data) throws Exception{
BufferedReader reader = new BufferedReader(new FileReader("c:\somePath\healthCare.arff"));
Instances data = new Instances(reader);
data1.setClassIndex(data1.numAttributes() - 1);
LinearRegression2 rl=new LinearRegression2();
rl.buildClassifier(data); //What after this? or before

Linear Regression should accept both nominal and numeric data types. It is simply that the target class cannot be a nominal data type.
The Model's toString() method should be able to spit out the model (other classifier options may also be required depending on your needs), but if you are also after the predictions and summaries, you may also need an Evaluation object. There, you could use toSummaryString() or toMatrixString() to obtain some other statistics about the model that was generated.
Hope this Helps!

Related

Feeding a spark dataset as a java reader in csv format

I am trying to implement my own org.apache.spark.ml.Transformer and I need to pass the contents of my org.apache.spark.sql.Dataset in csv format to my Java library which accepts a java.io.reader. I am struggling here because it seems this is really two different worlds. Ideally I don't want to have to create a string out of it, I would want to stream it. At this specific step the data shouldn't be larger than about a gigabyte though so I guess I could make do with a String solution if it is absolutely needed.
In an attempt to get a string I tried something like:
class Bla (override val uid: String) extends Transformer {
[...]
def transform(df: Dataset[_]): DataFrame = {
df.rdd.map(x=>x.mkString(","))
[...]
But I get several errors:
value mkString is not a member of _$1
polymorphic expression cannot be instantiated to expected type; found :
[U]org.apache.spark.rdd.RDD[U]
required: org.apache.spark.sql.DataFrame (which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
So any suggestions?
Edit: I have made a little outline of what I need to at https://github.com/jonalv/spark_java_problem

'distinct' MongoDB function in java program

I am pretty new to mongo db, and I have a simple question regarding a trouble I can’t solve in my Java program (3.0.2 client version). My aim is to perform a distinct on the “cars” test database, and I am trying this code:
DistinctIterable<Object> classification = collection.distinct("classification", null);
I can’t figure out what should I put in the second parameter. Could you help me please?
If you are using the Java API, I think that you can pass just the first argument, that would be the one on which do the distinct. The second parameter would be the query to filter on, but it can be omitted according to the documentation.
public List distinct(String fieldName)
Find the distinct values for a specified field across a collection and returns the results in an array.
Parameters:
fieldName - Specifies the field for which to return the distinct values.
Returns:
a List of the distinct values
You need to provide the class to map to, see http://api.mongodb.org/java/current/com/mongodb/client/MongoCollection.html

How to get results from WEKA

I understand how to use WEKA APIs I first load the arff into the program which creates Instances. These will then be given to a Classifier that has been trained on this Dataset. Now I want to give it a new test dataset without a label and make the WEKA API tell me what the label for that instance is or may be. How is that done?
You use Classifier.classifyInstance(Instance)
http://weka.sourceforge.net/doc/weka/classifiers/Classifier.html
Your training and test instances should look exactly the same.
feature value 1, feature value 2......., feature value n, class value
feature value 1, feature value 2......., feature value n, class value
When you are applying your model on your test set, Weka will not provide your model the class value of the instances. Rather it will ask, "hey, classifier, let me see how you assign classes to each of the test instances as you learned from training set". Then the classifier model assigns each test instance a class from what it learned from training set. Weka then compares it and provides result in terms of precision, recall, f-score, ROC, AUC, errors, etc. So, in summary, your test instance will have the class values. Don't exclude that. Otherwise, you will get an error like "training and test sets are incompatible".

Is it possible to send an object to a file in a format that is human readable?

I am working on a project for my advanced Java class, and the assignment says he wants us
to send an object to a file, which is easy enough, but that he also wants the file to be human readable and editable. I sent him an e-mail 3 days ago and he hasn't responded, so
I am kind of stuck between a rock and hard place since the project is due in 3 days.
So would any of you clever programmers be able to fill me in on the secret that apparently I am left out of.
How do you send an object to a file that reads like English?
I want to have the ability to both read and write a to-do item to a
file. I see our application looking like:
When it first starts, the program asks the user if there is a file containing to-do items. If so, the user will name the file, the
program will read it in and continue.
When the user decides to exit, the program will prompt the user - to ask if the to-do items should be saved to a file - if so, the
user will name the file and the program will write them out in
such a fashion that it can read them in again.
I want these file to be human readable (and editable). No binary data.
No counting. My advice to you would be to have a method somewhere that
looked like:
public ToDoItem getToDoItem(FileInputStream fis) {
// ...
}
and
public void writeToDoItem(FileOutputStream fos) {
// ...
}
Think of your serialization model. The ObjectOutputStream might write bytes, but is there another way you could represent the object and write it through some other output stream that writes human-readable text?
This is going to depend on the type of object you have. You will have to tailor it to a particular type of data.
For example, if you have an Object
String title;
List<Integer> ids;
then you could save it as JSON
{
title: 'aaaa',
ids: [ 1,2,3,4,5 ]
}
which is equivalent, but much more readable than a binary ObjectOutputStream.
Again, this won't work for all kinds of data.
There is an XML-based bean serialization, too, which also works with almost all data, but I would not call that human-readable.
Think how you would represent an object on paper in such a way that it could be reconstructed unambiguously. You'd probably list the class name, then you'd list each field name and its current value. If the field was a primitive, the value would be just the primitive value. If it was a reference type, you'd represent the object recursively using this procedure. If it was an array, you'd list each element value.
There are various standard ways of formatting such a representation (XML and JSON to name a couple). The key is to make it a text-only representation so it is human-readable.
You can have a try with JAXB.(Java Architecture for XML Binding)
It can send a JAXB styled object to a xml file.
But you should define a XML Schema file at first.
For more:http://jaxb.java.net/tutorial/
The human readable that you desire could be XML or JSON. My answer How to create object tree from xsd in Java?
might help in giving you pointers to the approach you can follow to achieve what you want.

Performance-effective way to transform XML data represented as Writeable

I'm working on utility method that allows conversion of XML data into formatted String and before you're going to think it's a trivial task for javax.xml.transform.Transformer let me explain the specific constraints I've faced with.
The input data does not exist at the moment conversion starts. Actually it's represented as groovy.lang.Writeable (javadoc) instance that I could output into any java.io.Writer instance. Signature of method looks like this:
static String serializeToString(Writable source)
My current solution involves few steps and actually provides expected result:
Create StringWriter, output source there and convert to String
Create javax.xml.transform.stream.StreamSource instance based on this string (using StringReader)
Create new StringWriter instance and wrap it into javax.xml.transform.stream.StreamResult
Perform transformation using instance of javax.xml.transform.Transformer
Convert StringWriter to String
While solution does work I'm not pleased enough with its efficiency. This method will be used really often and I do want to optimize it. What I'd like to avoid is necessity to perform multiple conversions along the line:
From Writeable to String (unformatted)
From String to StreamSource (which means that data will be parsed again)
From StreamSource to String again (formatted)
So the question is whether it's possible to build pipe-like flow which eliminates unnecessary conversions?
UPDATE #1:
To give a little bit more context, I'm converting GPathResult instance to formatted string using StreamingMarkupBuilder.bindNode() method which produces Writable instance. Unfortunately there is no way to specify StreamingMarkupBuilder to produce formatted output.
UPDATE #2:
I did experiment with implementation based on PipedWriter + PipedReader but experiments didn't show much speed gain from this approach. Looks like it's not that critical issue in this case.
Not knowing what you mean exactly by "XML data", but you could think of representing the "Yet-to-be" stuff as a SAXSource directly, thereby by-passing the "to-string" and "parse-string" steps.

Categories