Calculating TDIST using Apache Commons library - java

I'm trying to calculate a 2 tailed Student Distribution using commons-math. I'm using Excel to compare values and validate if my results are correct.
So Using excel to calculate TDIST(x, df, t) with x = 5.968191467, df = 8, tail t = 2
=TDIST(ABS(5.968191467),8,2)
And get the Result: 0.000335084
Using commons Math like so :
TDistribution tDistribution = new TDistribution(8);
System.out.println(BigDecimal.valueOf(tDistribution.density(5.968191467)));
I get Result : 0.00018738010608336254
What should I be using to get the result exactly like the TDIST value?

To replicate your formula in Excel you can use CDF:
2*(1.0 - tDistribution.cumulativeProbability(5.968191467))

The right formula for a general x is:
2*(1.0 - tDistribution.cumulativeProbability(Math.abs(x)))
(thanks to ryuichiro). Do not forget to add the absolute value, because TDIST for 2 degrees of freedom in Excel is a symmetrical formula, that is
TDIST(-x,df,2) = TDIST(x,df,2)
and the one of ryuchiro would not work for negative x's. Check also the docs or this.

Related

Get unique words from Spark Dataset in Java

I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.
based on the #Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();
I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode method in order to transform your data into a Dataset of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.

how to use mongodb java-driver Projections.slice

I am trying to use Aggregates.project to slice the array in my documents.
My documents is like
{
"date":"",
"stype_0":[1,2,3,4]
}
in the mongochef looks like
and my code in java is :
Aggregates.project(Projections.fields(
Projections.slice("stype_0", pst-1, pen-pst),Projections.slice("stype_1", pst-1, pen-pst),
Projections.slice("stype_2", pst-1, pen-pst),Projections.slice("stype_3", pst-1, pen-pst))))
finally i get error
First argument to $slice must be an array, but is of type: int
I guess that is because the first element in stype_0 is int , but I really do not know why? Thanks a lot!
Slice has two versions. $slice(aggregation) & $slice(projection). You are using the wrong one.
Aggregate slice function doesn't have any built-in support. Below is an example for one such projection. Do the same for all the other projection fields.
List stype_0 = Arrays.asList("$stype_0", 1, 1);
Bson project = Aggregates.project(Projections.fields(new Document("stype_0", new Document("$slice", stype_0))));
AggregateIterable<Document> iterable = dbCollection.aggregate(Arrays.asList(project));

How to train Matrix Factorization Model in Apache Spark MLlib's ALS Using Training, Test and Validation datasets

I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test as suggest on forums.
I've found following code sample to train model on these sets.
val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
val testRmse = computeRmse(bestModel.get, test, numTest)
This code trains model for each combination of rank and lambda and compares rmse (root mean squared error) with validation set. These iterations gives a better model which we can say is represented by (rank,lambda) pair. But it doesn't do much after that on test set.
It just computes the rmse with `test' set.
My question is how it can be further tuned with test set data.
No, one would never fine tune the model using test data. If you do that, it stops being your test data.
I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6
Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.

JavaPlot timestamps not working

I have a file like
1429520881 15.0
1429520882 3.0
1429520883 340.0
and I try to use it in JavaPlot
JavaPlot plot=new JavaPlot();
GenericDataSet dataset=new GenericDataSet();
filling dataset with data
...
plot.set("xdata","time");
plot.set("timefmt","'%s'");
plot.set("format x","'%H:%M:%S'");
plot.plot();
in result gnuplot's window don't appear but If I try this file directly in gnuplot with the same data and options it shows me a time on xAxis; If in JavaPlot I delete last settings(xdata, timefmt,format) it works but it shows me only numbers
I also tried to create manualy dataset with data in program but the same result.
I also implement new DataSet with date as String but it seems that xdata,time option doesn't work
It took forever to figure this one out. I found that if you have a DataSetPlot object you can set the 'using' option:
DataSetPlot dataSet = new DataSetPlot( values );
dataSet.set( "using", "1:2" );
This will then make use of the 'using' option for the plot command, eg:
plot '-' using 1:2 title 'Success' with lines linetype rgb 'green'
You have to have the 'using' option when making use of time for the x axis, otherwise you will see this error:
Need full using spec for x time data
It generates temp script file with data inside in weird order because of ParametersHolder inherits HashMap and there should be "using" keyword after '-'
for example:
I wrote LinkedParams extends GNUPlotParameters class with inner LinkedMap and overrided methods to use inner structure;
set ... ...(xrange,yrange etc)
set xdata time
set timefmt '%s'
set format x '%H:%M:%S'
plot '-' using 1:2 title 'ololo' with linesploints lineType 2 lineWidth 3
1429520881 15.0
1429520882 3.0
1429520883 340.0
e
quit
but it was
set xdata time
set ... ...(xrange,yrange etc)
set format x '%H:%M:%S'
set timefmt '%s'
plot '-' title 'ololo' with linesploints lineType 2 lineWidth 3
1429520881 15.0
1429520882 3.0
1429520883 340.0
e
quit

what is this error and how do I prevent this? The bucket expression values are not comparable and no comparator specified

Im using jasperReports with dynamicReports and I want to build a crosstab report. so far I have figured out that this error happens when I add columns that are numeric to rowGroups or columnGroups. this is what I get and I don't know why and I don't know how to solve this.
The error is:
The bucket expression values are not comparable and no comparator specified
My code is:
CrosstabValues crosstabValues = report.getCrosstab().getCrosstabValues();
Collection<CrosstabRowGroupBuilder> rowGroup = generateRowGroup(crosstabValues);
Collection<CrosstabColumnGroupBuilder> columnGroup = generateColumnGroup(crosstabValues);
Collection<CrosstabMeasureBuilder> measures = generateMeasures(crosstabValues);
CrosstabBuilder crosstab = ctab.crosstab();
for(CrosstabRowGroupBuilder row : rowGroup)
crosstab.addRowGroup(row);
for(CrosstabColumnGroupBuilder columnGroupBuilder : columnGroup)
crosstab.addColumnGroup(columnGroupBuilder);
for(CrosstabMeasureBuilder measure : measures)
crosstab.addMeasure(measure);
crosstab.headerCell(cmp.text(crosstabValues.getHeader())
.setStyle(getCrosstabHeaderCellStyle(report.getTemplate().getReportTemplateValues())));
the problem was the class I was giving to this method:
CrosstabRowGroupBuilder cTabRow = ctab.rowGroup(column.getName()
, getColumnTypeClass(column));
i was using Number class for all numeric data. the funny thing is that it worked for measures but it did not work for rowGroup or columnGroup. that is why I got confused.
now with Integer.Class or Long.Class it works good.
Crosstab must know in which order display rowHeader or columnHeader. And crosstab must know in which cell of crosstab put measure. It is possible only if crosstab is able compare rowGroup (and ColumnGroup) values.
Classes which used in rowGroup and columnGroup must implements Comparable interface

Categories