I am using the Apache Commons lib to calculate the p-value with the ChiSquareTest:
I use the method chiSquareTest(double[] expected, long[] observed); But the values I get back don't make sense to me. So I tried numerous ChiSquare Online Calculators to find out what this function calculates.
An example:
Group 1: {25,25}
Group 2: {30,20}
(Taken from Wikipedia, German Chi Square Test article)
P- values from:
http://www.quantpsy.org/chisq/chisq.htm and
http://vassarstats.net/newcs.html
P = 0.3149 and 0.31490284
0.42154642 and 0.4201
(with and without Yates Correction)
Apache Commons: 0.1489146731787664
Code:
ChiSquareTest tester = new ChiSquareTest();
long[] b = {25,25};
double[] a = {30,20};
tester.chiSquareTest(a,b);
Another thing I do not understand is the need to have a long and a double array. Why not two long arrays?
There are two functions in the lib:
chiSquareTest(double[] expected, long[] observed)
chiSquareTest(long[][] values)
The first one (which I used in the question above) computes the goodness of a fit. But I expected the result from the second one, the test of independence.
The answer was given to me on the Apache Commons user Mailinglist, I will add a link to the archive once it is there. But it is also written in the JavaDoc.
Update:
Mailinglist Archive
I would like to stop training a network once I see the error calculated from the validation set starts to increase. I'm using a BasicNetwork with RPROP as the training algorithm, and I have the following training iteration:
double validationError = 999.999;
while(!stop){
train.iteration(); //weights are updated here
System.out.println("Epoch #" + epoch + " Error : " + train.getError()) ;
//I'm just comparing to see if the error on the validation set increased or not
if (network.calculateError(validationSet) < validationError)
validationError = network.calculateError(validationSet);
else
//once the error increases I stop the training.
stop = true ;
System.out.println("Epoch #" + epoch + "Validation Error" + network.calculateError(validationSet));
epoch++;
}
train.finishTraining();
Obviously this isn't working because the weights have already been changed before figuring out if I need to stop training or not. Is there anyway I can take a step back and use the old weights?
I also see the EarlyStoppingStrategy class which is probably what I need to use by using the addStrategy() method. However, I really don't understand why the EarlyStoppingStrategy constructor takes both the validation set and the test set. I thought it would only need the validation set and the test set shouldn't be used at all until I test the output of the network.
Encog's EarlyStoppingStrategy class implements an early stopping strategy according to this paper:
Proben1 | A Set of Neural Network Benchmark Problems and Benchmarking Rules
(a full cite is included in the Javadoc)
If you just want to stop as soon as a validation set's error no longer improves you may just want to use the Encog SimpleEarlyStoppingStrategy, found here:
org.encog.ml.train.strategy.end.SimpleEarlyStoppingStrategy
Note, that SimpleEarlyStoppingStrategy requires Encog 3.3.
For my App I need compact code for converting between LatLon (WGS84) and MGRS.
JCoord.jar:
Looks great, but the version 1.1 jar is 0.5Mb in size. That is doubles my App for only perfoming a 2-way conversion of coordinates.
Openmap:
Isolating just the MGRSPoint.java (https://code.google.com/p/openmap/source/browse/src/openmap/com/bbn/openmap/proj/coords/MGRSPoint.java) from the rest is not easy.
GeographicLib:
This seems a good solution, but I could not find a Java source file.
Is it available for usage?
Nasa:
The Nasa code looks great, see http://worldwind31.arc.nasa.gov/svn/trunk/WorldWind/src/gov/nasa/worldwind/geom/coords/MGRSCoordConverter.java. Isolating just the MGRS conversion code was not easy.
GDAL:
Was implemented in another programming language.
IBM (via j-coordconvert.zip):
Is complact, suits well for the UTM conversion, but the MGRS conversion is described to be errorneous. Alas.
Is there a good (compact) Java source for converting between LatLon/wgs84 and MGRS?
Finally found a sufficient good answer. Berico, thank you!
https://github.com/Berico-Technologies/Geo-Coordinate-Conversion-Java
This source code isolates the NASA Java source code and adds 1 nice utility class.
Examples:
double lat = 52.202050;
double lon = 6.102050;
System.out.println( "To MGRS is " + Coordinates.mgrsFromLatLon( lat, lon));
And the other way around:
String mgrs = "31UCU 59248 14149";
double[] latlon = Coordinates.latLonFromMgrs( mgrs);
i want to be able to build a model using java, i am able to do so with CLI as folowing:
./mahout trainlogistic --input Candy-Crush.twtr.csv \
--output ./model \
--target hd_click --categories 2 \
--predictors click_frequency country_code ctr device_price_range hd_conversion time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count \
--types numeric word numeric word word word numeric word word word numeric \
--features 100 --passes 1 --rate 50
i cant understand the 20 news group example because its to big to learn from.
can anyone give me a code that is doing the same as the cli command?
to clarify:
i need something like this:
model.train(1,0,"monday",6,44,1,7,4,6,78,7,3,4,6,........,"good");
model.train(1,0,"sunday",6,44,5,7,9,2,4,6,78,7,3,4,6,........,"bad");
model.train(1,0,"monday",4,99,2,4,6,3,4,6,........,"good");
model.writeTofile("myModel.model");
PLESE DO NOT ANSWER IF YOU ARE NOT FAMILIAR WITH CLASSIFICATION AND ONLY WANT TO TELL ME HOW TO EXECUTE CLI COMMAND FROM JAVA
I am not 100% familiar with the Mahout API (I agree that documentation is very sparse) so I can only give pointers, but I hope it helps:
The Java source code for the trainlogistic example can actually be found in the mahout-examples library - it's on maven [0] (in org.apache.mahout.classifier.sgd.TrainLogistic). I suppose if you wanted to, you could just use the exact same source code, but it depends on a couple of utility classes in the mahout-examples library (and it's not very clean, either).
The class performing the training in this example is org.apache.mahout.classifier.sgd.OnlineLogisticRegression [1], although considering the large number of predictor variables you have you might want to use the AdaptiveLogisticRegression [2] (same package), which uses a number of OnlineLogisticRegressions internally. But you have to see for yourself which works best with your data.
The API is fairly straightforward, there's a train method which takes a Vector of your input data and a classify method to test your model, as well as learningRate and others to change the model's parameters.
To save the model to disk like the command line tool does, use the org.apache.mahout.classifier.sgd.ModelSerializer, which has a straightforward API to write and read your model. (There's also write and readFields methods in the OLR class itself, but frankly, I'm not sure what they do or if there's a difference to ModelSerializer - they're not documented either.)
Lastly, aside from the source code in mahout-examples, here's two other example of using the Mahout API directly, that might be useful [3, 4].
Sources:
[0] http://repo1.maven.org/maven2/org/apache/mahout/mahout-examples/0.8/
[1] http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/OnlineLogisticRegression.html
[2] http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/AdaptiveLogisticRegression.html
[3] http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3CCAJwFCa3X2fL_SRxT7f7v9uMjS3Tc9WrT7vuMQCVXyH71k0H0zQ#mail.gmail.com%3E
[4] http://skife.org/mahout/2013/02/14/first_steps_with_mahout.html
This blog has a good post about how to do training and classification with Mahout Java API: http://nigap.blogspot.com/2012/02/bayes-algorithm-with-apache-mahout.html
You could use Runtime.exec to execute the same cmd line from java.
The simple approach is:
Process p = Runtime.getRuntime().exec("/usr/bin/bash -ic \"<path_to_mahout>/mahout trainlogistic --input Candy-Crush.twtr.csv "
+ "--output ./model "
+ "--target hd_click --categories 2 "
+ "--predictors click_frequency country_code ctr device_price_range hd_conversion time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count "
+ "--types numeric word numeric word word word numeric word word word numeric "
+ "--features 100 --passes 1 --rate 50\"");
If you opt for this, then I suggest reading this first:
When Runtime.exec() won't
This way the application will run in a diffent process.
Additionally you can follow the section 'Integration with your application' from the following site:
Recomender Documentation
Also this is a good reference on writing a recomender:
Introducing Apache Mahout
Hope this helps.
Cheers
I am getting a wrong eigen-vector (also checked by running multiple times to be sure) when i am using matrix.eig(). The matrix is:
1.2290 1.2168 2.8760 2.6370 2.2949 2.6402
1.2168 0.9476 2.5179 2.1737 1.9795 2.2828
2.8760 2.5179 8.8114 8.6530 7.3910 8.1058
2.6370 2.1737 8.6530 7.6366 6.9503 7.6743
2.2949 1.9795 7.3910 6.9503 6.2722 7.3441
2.6402 2.2828 8.1058 7.6743 7.3441 7.6870
The function returns the eigen vectors:
-0.1698 0.6764 0.1442 -0.6929 -0.1069 0.0365
-0.1460 0.6478 0.1926 0.6898 0.0483 -0.2094
-0.5239 0.0780 -0.5236 0.1621 -0.2244 0.6072
-0.4906 -0.0758 -0.4573 -0.1279 0.2842 -0.6688
-0.4428 -0.2770 0.4307 0.0226 -0.6959 -0.2383
-0.4884 -0.1852 0.5228 -0.0312 0.6089 0.2865
Matlab gives the following eigen-vector for the same input:
0.1698 -0.6762 -0.1439 0.6931 0.1069 0.0365
0.1460 -0.6481 -0.1926 -0.6895 -0.0483 -0.2094
0.5237 -0.0780 0.5233 -0.1622 0.2238 0.6077
0.4907 0.0758 0.4577 0.1278 -0.2840 -0.6686
0.4425 0.2766 -0.4298 -0.0227 0.6968 -0.2384
0.4888 0.1854 -0.5236 0.0313 -0.6082 0.2857
The eigen-values for matlab and jama are matching but eigen-vectors the first 5 columns are reversed in sign and only the last column is accurate.
Is there any issue on the kind of input that Jama.Matrix.EigenvalueDecomposition.eig()
accepts or any other problem with the same? Please tell me how i can fix the error. Thanks in advance.
There is no error here, both results are correct - as is any other scalar times the eigen vectors.
There are an infinite number of eigen vectors that work - its just convention that most software programs report the vectors that have length of one. That Jama reports eigen vectors equal to -1 times those of Matlab is probably just an artifact of the algorithm they used.
For a given matrix, the eigenvalues are unique, whose number equals the dimension of the matrix if plurality is considered. While the corresponding eigenvalues might be different because the vectors can scale according to a certain direction. In your post results, both JAVA and Matlab versions are correct.
Also, you could check the D matrix, where the eigenvalues come from. You could find they are the same.