How to get predicted values from JavaDecsionTreeRegressionExample.java of Spark MLlib? - java

I would like to get predicted values from JavaDecisionTreeRegressionExample.java, but not only the description of the decision tree and metrics such as MAE and RMSE. Does anyone know how to do it or which method can I use it to get the predicted values?
I have tried many methods, which are provided by RegressionEvaluator and DecisionTreeRegressionModel classes, to solve this problem, but I still don't know how to get them. So, if anyone knows how to do it, please show me. Thank you very much!
The following is the source code of JavaDecisionTreeRegressionExample.java
package org.apache.spark.examples.ml;
// $example on$
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.evaluation.RegressionEvaluator;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;
import org.apache.spark.ml.regression.DecisionTreeRegressionModel;
import org.apache.spark.ml.regression.DecisionTreeRegressor;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
// $example off$
public class JavaDecisionTreeRegressionExample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("JavaDecisionTreeRegressionExample")
.getOrCreate();
// $example on$
// Load the data stored in LIBSVM format as a DataFrame.
Dataset<Row> data = spark.read().format("libsvm")
.load("data/mllib/sample_libsvm_data.txt");
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data);
// Split the data into training and test sets (30% held out for testing).
Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
// Train a DecisionTree model.
DecisionTreeRegressor dt = new DecisionTreeRegressor()
.setFeaturesCol("indexedFeatures");
// Chain indexer and tree in a Pipeline.
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{featureIndexer, dt});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
// Select example rows to display.
predictions.select("label", "features").show(5);
// Select (prediction, true label) and compute test error.
RegressionEvaluator evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse");
double rmse = evaluator.evaluate(predictions);
System.out.println("Root Mean Squared Error (RMSE) on test data = " + rmse);
DecisionTreeRegressionModel treeModel =
(DecisionTreeRegressionModel) (model.stages()[1]);
System.out.println("Learned regression tree model:\n" + treeModel.toDebugString());
// $example off$
spark.stop();
}
}

I solve my problem. Modify predictions.select("label", "features").show(5); to predictions.select("prediction","label", "features").show(5); Then, you can get predicted values.

Related

Iterating over RelationalGroupedDataset to find average and count of each key in Java

I have a Dataset<Row> which is built by reading a CSV file. I want to do the group by on one of the fields in CSV and then merge all the records with the same name and do some other computation over the merged Dataset.
My input CSV file looks like this
name,math_marks,science_marks
Ajay,10,20
Ram,15,25
Sita,18,30
Ajay,20,30
Sita,12,10
Sita,20,20
Ram,25,45
I want the final output to be something like this
name,math_avg,science_avg,count_of_records
Ajay,15,25,2
Ram,20,35,2
Sita,25,20,3
My initial code in Java is below:
import lombok.extern.slf4j.Slf4j;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.List;
import java.util.stream.Collectors;
#Slf4j
public class ReadCSVFiles {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName(ReadCSVFiles.class.getName()).setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);
context.setLogLevel("INFO");
Dataset<Row> df = sparkSession.read()
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load("/Users/ajaychoudhary/Downloads/marksInputFile.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print name of dataframe ==============");
df.select("name").show();
RelationalGroupedDataset relationalGroupedDataset = df.groupBy("name");
List<String> relationalGroupedDatasetRows = relationalGroupedDataset.count().collectAsList().stream()
.map(a -> a.mkString("::")).collect(Collectors.toList());
log.info("relationalGroupedDatasetRows is = {} ", relationalGroupedDatasetRows);
}
}
I am receiving this output as of now which is able to find the count of unique users. I am unable to find the average of the marks.
relationalGroupedDatasetRows is = [Ram::2, Ajay::2, Sita::3]
Also, I need to understand whether the above approach of using groupBy is fine or we can use some other alternate to achieve this.
I don't know much about this but you are using the
'count' method which "counts the number of rows for each group". Instead try using
'avg' method which "returns average for each group".

How to use QueryParser for Lucene range queries (IntPoint/LongPoint)

One thing I really like about Lucene is the query language where I/an application user) can write dynamic queries. I parse these queries via
QueryParser parser = new QueryParser("", indexWriter.getAnalyzer());
Query query = parser.parse("id:1 OR id:3");
But this does not work for range queries like these one:
Query query = parser.parse("value:[100 TO 202]"); // Returns nothing
Query query = parser.parse("id:1 OR value:167"); // Returns only document with ID 1 and not 1
On the other hand, via API it works (But I give up the convenient way to just use the query as input):
Query query = LongPoint.newRangeQuery("value", 100L, 202L); // Returns 1, 2 and 3
Is this a bug in query parser or do I miss an important point, like QueryParser takes the lexical and not numerical value? How can I chance this without using the query API but parsing the string?
The question is a follow up to this question that pointed out the problem, but not the reason: Lucene LongPoint Range search doesn't work
Full code:
package acme.prod;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import java.util.Arrays;
import java.util.List;
import java.util.UUID;
public class LuceneRangeExample {
public static void main(String[] arguments) throws Exception {
// Create the index
Directory searchDirectoryIndex = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(searchDirectoryIndex, new IndexWriterConfig(new StandardAnalyzer()));
// Add several documents that have and ID and a value
List<Long> values = Arrays.asList(23L, 145L, 167L, 201L, 20100L);
int counter = 0;
for (Long value : values) {
Document document = new Document();
document.add(new StringField("id", Integer.toString(counter), Field.Store.YES));
document.add(new LongPoint("value", value));
document.add(new StoredField("value", Long.toString(value)));
indexWriter.addDocument(document);
indexWriter.commit();
counter++;
}
// Create the reader and search for the range 100 to 200
IndexReader indexReader = DirectoryReader.open(indexWriter);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser("", indexWriter.getAnalyzer());
// Query query = parser.parse("id:1 OR value:167");
// Query query = parser.parse("value:[100 TO 202]");
Query query = LongPoint.newRangeQuery("value", 100L, 202L);
TopDocs hits = indexSearcher.search(query, 100);
for (int i = 0; i < hits.scoreDocs.length; i++) {
int docid = hits.scoreDocs[i].doc;
Document document = indexSearcher.doc(docid);
System.out.println("ID: " + document.get("id") + " with range value " + document.get("value"));
}
}
}
I think there are a few different things to note here:
1. Using the classic parser
As you show in your question, the classic parser supports range searches, as documented here. But the key point to note in the documentation is:
Sorting is done lexicographically.
That is to say, it uses text-based sorting to determine whether a field's values are within the range or not.
However, your field is a LongPoint field (again, as you show in your code). This field stores your data as an array of longs, as shown in the constructor.
This is not lexicographical data - and even when you only have one value, it's not handled as string data.
I assume that this is why the following queries do not work as expected - but I am not 100% sure of this, because I did not find any documentation confirming this:
Query query = parser.parse("id:1 OR value:167");
Query query = parser.parse("value:[100 TO 202]");
(I am slightly surprised that these queries do not throw errors).
2. Using a LongPoint Query
As you have also shown, you can use one of the specialized LongPoint queries to get the results you expect - in your case, you used LongPoint.newRangeQuery("value", 100L, 202L);.
But as you also note, you lose the benefits of the classic parser syntax.
3. Using the Standard Query Parser
This may be a good approach which allows you to continue using your preferred syntax, while also supporting number-based range searches.
The StandardQueryParser is a newer alternative to the classic parser, but it uses the same syntax as the classic parser by default.
This parser lets you configure a "points config map", which tells the parser which fields to handle as numeric data, for operations such as range searches.
For example:
import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
import org.apache.lucene.queryparser.flexible.standard.config.PointsConfig;
import java.text.DecimalFormat;
import java.util.Map;
import java.util.HashMap;
...
StandardQueryParser parser = new StandardQueryParser();
parser.setAnalyzer(indexWriter.getAnalyzer());
// Here I am just using the default decimal format - but you can provide
// a specific format string, as needed:
PointsConfig pointsConfig = new PointsConfig(new DecimalFormat(), Long.class);
Map<String, PointsConfig> pointsConfigMap = new HashMap<>();
pointsConfigMap.put("value", pointsConfig);
parser.setPointsConfigMap(pointsConfigMap);
Query query1 = parser.parse("value:[101 TO 203]", "");
Running your index searcher code with the above query gives the following output:
ID: 1 with range value 145
ID: 2 with range value 167
ID: 3 with range value 201
Note that this correctly excludes the 20100L value (which would be included if the query was using lexical sorting).
I don't know of any way to get the same results using only the classic query parser - but at least this is using the same query syntax that you would prefer to use.

Group by column and write each group of strings to text file using Apache Spark and Java

I have a .csv file with the columns id and a couple of string columns. I want to group by id and then write all of the values from string_column1 to a text file (each value on a new row). Finally, I want the name of the text file to be "allstrings"+id.
I'm using Apache Spark with Java.
I've tried to use groupBy("id").agg(collect_list("string_column1")) but I get "The method collect_list(String) is undefined for the type Main".
I don't know how to name the text files using the distinct values from the id column.
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.RelationalGroupedDataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class Main {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
SparkSession spark = SparkSession.builder()
.appName("testingSql")
.master("local[*]")
.getOrCreate();
Dataset<Row> dataset = spark.read()
.option("header", true)
.csv("src/main/resources/maininput.csv");
// make a separate .csv file for each group of strings (grouped by id),
// with each string on a new line
// and the name of the file should be "allstrings"+id
RelationalGroupedDataset result = dataset.groupBy("id")
.agg(collect_list("string_column1"))
.?????????;
spark.close();
}
}
You can partition data on write, it will create separated directories for each group id
and name of each folder will be in format column_name=value.
df.write.partitionBy("id").csv("output_directory")
Then you can use org.apache.hadoop.fs._ to rename files from each group directory.

Apache Spark MultilayerPerceptronClassifier setting fatures

I am trying to do a multi class classification using org.apache.spark.ml.classification.MultilayerPerceptronClassifier. Given below is the code I used. I have 262 features and I have to give the feature columns to the MultilayerPerceptronClassifier. Can someone explain me a way to give features to the MultilayerPerceptronClassifier.
I can use setFeaturesCol() method to give features but it is infeasible because by using it, I can add only one feature at a time but I have 262 features.
import org.apache.commons.lang3.ArrayUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.sql.DataFrame;
public class NN {
final static String RESPONSE_VARIABLE = "Activity";
public static void main(String args[]){
// Load training data
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("test-client").setMaster("local[2]");
sparkConf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
// Convert data in csv format to Spark data frame
DataFrame trainDataFrame = sqlContext.read().format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/home/thamali/Desktop/Project/csv/libsvm/train.csv");
DataFrame testDataFrame = sqlContext.read().format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/home/thamali/Desktop/Project/csv/libsvm/train.csv");
String [] predictors = trainDataFrame.columns();
predictors = ArrayUtils.removeElement(predictors, RESPONSE_VARIABLE);
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4
// and output of size 3 (classes)
int[] layers = new int[] {262, 50, 40, 12};
// create the trainer and set its parameters
MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100);
// train the model
MultilayerPerceptronClassificationModel model = trainer.fit(trainDataFrame);
// compute accuracy on the test set
DataFrame result = model.transform(testDataFrame);
DataFrame predictionAndLabels = result.select("prediction", "label");
MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy");
System.out.println("Accuracy = " + evaluator.evaluate(predictionAndLabels));
}
}
We can use Apache spark vector Assembler to create a vector containing all the necessary features.

Apache Spark - datediff for dataframes?

I'm trying to compute a column based on date difference. Is there a corresponding function for datediff that can be used on a column/dataframe? Fe.
Column new = old.col("one").divide(old.col("max").minus(old.col("min")));
But in this case, the minus function doesn't work, because the min and max columns contain dates. So I need something like datediff for Columns. Is there such a thing?
Thank you!
There is and it is called datediff (org.apache.spark.sql.functions.datediff):
public static Column datediff(Column end,
Column start)
Returns the number of days from start to end.
Parameters:
end - (undocumented)
start - (undocumented)
Returns:
(undocumented)
Since:
1.5.0
Example:
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SQLContext;
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.DataFrame;
public class App {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext= new SQLContext(sc);
DataFrame df = sqlContext.sql(
"SELECT CAST('2012-01-01' AS DATE), CAST('2013-08-02' AS DATE)").toDF("first", "second");
df.select(datediff(df.col("first"), df.col("second"))).show();
}
}

Categories