Get cluster assignments in Weka - java

I have a CSV file as follows:
id,at1,at2,at3
1072,0.5,0.2,0.7
1092,0.2,0.5,0.7
...
I've loaded it in in Weka for clustering:
DataSource source = new DataSource("test.csv");
Instances data = source.getDataSet();
kmeans.buildClusterer(data);
Question #1: How do I set the first column as an ID? ie. ignoring the first column for clustering purposes.
I then try to print out the assignments:
int[] assignments = kmeans.getAssignments();
int i = 0;
for (int clusterNum : assignments) {
System.out.printf("Instance %d -> Cluster %d \n", i, clusterNum);
i++;
}
This prints:
Instance 1 -> Cluster 0
Instance 2 -> Cluster 2
...
Question #2: How do I refer to the ID when printing out the assignments? For example:
Instance 1072 -> Cluster 0
Instance 1092 -> Cluster 2

I realize this is an old question, but I came here looking for an answer as well, and then was able to figure it out myself, so putting my solution here for the next person with this problem. In my case, the clustering component is part of a Java application, so I don't have the option of using the Weka workbench. Here is what I did to pull out the id along with the cluster assignments.
int[] assignments = kmeans.getAssignments();
for (int i = 0; i < assignments.length; i++) {
int id = (int) data.instance(i).value(0); // cast from double
System.out.printf("ID %d -> Cluster %d \n", id, assignments[i]);
}
Unlike the OP, I did not build my Instances from DataSource.getDataSet(), I built this manually from a database table, but the id field was the first one in my case as well, so I think the code above should work. I had a custom distance function that skipped the id field when computing similarity.

Your life would be much easier if you use Windows version of Weka with GUI.
In cluster tab there is a button for ignoring attributes like ID.
And for Id to cluster assignments; after your are done with clustering algorithm you chose, right click the result on left of the screen, then visualize results and then save.

Related

How to get the total count of entities in a kind in Google Cloud Datastore

I have a kind having around 5 Million entities in the Google Cloud Datastore. I want to get this count programmatically using Java. I tried following code but it work upto certain threshold (800K).
When i ran query for 5 M records, it goes into infinite loop (my guess) since it doesn't returns any count. How to get the count of entities for this big data? I would not like to use Google App Engine API since it requires to setup environment.
private static Datastore datastore;
datastore = DatastoreOptions.getDefaultInstance().getService();
Query query = Query.newKeyQueryBuilder().setKind(kind).build();
int count = Iterators.size(datastore.run(query)); //count has the entities count
How accurate do you need the count to be? For an slightly out of date count you can use a stats entity to fetch the number of entities for a kind.
If you can't use the stale counts from the stats entity, then you'll need to keep counter entities for the real time counts that you need. You should consider using a sharded counter.
Check out Google Dataflow. A pipeline like the following should do it:
def send_count_to_call_back(callback_url):
def f(record_count):
r = requests.post(callback_url, data=json.dumps({
'record_count': record_count,
}))
return f
def run_pipeline(project, callback_url)
pipeline_options = PipelineOptions.from_dictionary({
'project': project,
'runner': 'DataflowRunner',
'staging_location':'gs://%s.appspot.com/dataflow-data/staging' % project,
'temp_location':'gs://%s.appspot.com/dataflow-data/temp' % project,
# .... other options
})
query = query_pb2.Query()
query.kind.add().name = 'YOUR_KIND_NAME_GOES HERE'
p = beam.Pipeline(options=pipeline_options)
_ = (p
| 'fetch all rows for query' >> ReadFromDatastore(project, query)
| 'count rows' >> apache_beam.combiners.Count.Globally()
| 'send count to callback' >> apache_beam.Map(send_count_to_call_back(callback_url))
)
I use python, but they have a Java sdk too https://beam.apache.org/documentation/programming-guide/
The only issue is your process will have to trigger this pipeline, let it run on its own for a few minutes, and then let it hit a callback URL to let you know it's done

Computing preference values in Apache Mahout

I am trying to learn Apache mahout, very new to this topic. I want to implement user-based recommender. For this, after exploring on the internet I have found some samples like below,
public static void main(String[] args) {
try {
int userId = 2;
DataModel model = new FileDataModel(new File("data/mydataset.csv"), ";");
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(userId, 3);
for (RecommendedItem recommendation : recommendations) {
logger.log(Level.INFO, "Item Id recommended : " + recommendation.getItemID() + " Ratings : "
+ recommendation.getValue() + " For UserId : " + userId);
}
} catch (Exception e) {
logger.log(Level.SEVERE, "Exception in main() ::", e);
}
I am using following dataset which contains userid, itemid, preference value respectively,
1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0
1,14,5.0
1,15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
3,12,4.5
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
In this case, it works fine, but my main question is I have the different set of data which don't have preference values, which contains some data based on that I am thinking to compute preference values. Following is my new dataset,
userid itemid likes shares comments
1 4 1 20 3
2 6 18 20 12
3 12 10 2 20
4 7 0 20 13
5 9 0 2 1
6 5 5 3 2
7 3 9 7 0
8 1 15 0 0
My question is how can I compute preference value for a particular record based on some other columns such as likes, shares, comments etc. Is there anyway to compute this in mahout?
Yes- I think your snippet is from an older version of Mahout, but what you want to use is the Correlated Co Occurrence recommender. The CCO Recommender is multi-modal (allows user to have various inputs).
There are CLI Drivers, but I'm guessing you want to code, there is a Scala tutorial here
In the tutorial I think it recommends 'friends' based on genres tagged and artists 'liked', as well as your current friends.
As #rawkintrevo says, Mahout has moved on from the older "taste" recommenders and they will be deprecated from Mahout soon.
You can build you own system from the CCO algorithm in Mahout here. It allows you to use data from different user behavior like "likes, shares, comments". So we call it multi-modal.
Or in another project we have created a full featured recommendation server based on Mahout, called the Universal Recommender. It is build on Apache PredicitonIO where the UR is a plugin called a Template. Together they deliver a nearly turnkey server that takes input and responds to queries. To get started easily try the AWS AMI that has the whole system working. Some other methods to install are shown here.
This is all Apache licensed OSS, but Mahout no longer can really provide a production ready environment, Mahout does algorithms but you need a system around it. Build your own or try the PredictionIO based one. Since everything is OSS you can tweak things if needed.

hbase how to choose pre split strategies and how its affect your rowkeys

I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
Is there any recommendation on choosing startkey and endkey based on dataset?
My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.
Does this approach to find startkey and rowkey looks ok or is there better practice?
EDIT
I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.
PS> I am displaying only some region info
1)
byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);
This gives regions with boundaries like:
{
"startkey":"-INFINITY",
"endkey":"11111111",
"numberofrows":3628951,
},
{
"startkey":"11111111",
"endkey":"22222222",
},
{
"startkey":"22222222",
"endkey":"33333333",
},
{
"startkey":"33333333",
"endkey":"44444444",
},
{
"startkey":"88888888",
"endkey":"99999999",
},
{
"startkey":"99999999",
"endkey":"aaaaaaaa",
},
{
"startkey":"aaaaaaaa",
"endkey":"bbbbbbbb",
},
{
"startkey":"eeeeeeee",
"endkey":"INFINITY",
}
This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.
2)
byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)
This has same issue:
{
"startkey":"-INFINITY",
"endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
"startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
"endkey":"33333332",
}
{
"startkey":"33333332",
"endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
"startkey":"\\xE6ffffffa",
"endkey":"INFINITY",
}
3) I tried supplying start key and end key and got following useless regions.
hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
Bytes.toBytes("01253|201501|805|1999"), 10);
{
"startkey":"-INFINITY",
"endkey":"04120|200808|805|1999",
}
{
"startkey":"04120|200808|805|1999",
"endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
"startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
"endkey":"000ptq<200wp6\\xBC805|1999",
}
{
"startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
"endkey":"01253|201501|805|1999",
}
{
"startkey":"01253|201501|805|1999",
"endkey":"INFINITY",
}
First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.
But underlying thing is,
With your rowkey design, data should be distributed across the regions and not hotspotted
(36.1. Hotspotting)
However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.
In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).
Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting.
like
EDIT : If you want to implement own regionSplit
you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override
public byte[][] split(int numberOfSplits)
Second question :
My understanding :
You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.
If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..
you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.
to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter
for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'
which displays all your 100 rowkeys.
if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..
echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt

How to train Matrix Factorization Model in Apache Spark MLlib's ALS Using Training, Test and Validation datasets

I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test as suggest on forums.
I've found following code sample to train model on these sets.
val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
val testRmse = computeRmse(bestModel.get, test, numTest)
This code trains model for each combination of rank and lambda and compares rmse (root mean squared error) with validation set. These iterations gives a better model which we can say is represented by (rank,lambda) pair. But it doesn't do much after that on test set.
It just computes the rmse with `test' set.
My question is how it can be further tuned with test set data.
No, one would never fine tune the model using test data. If you do that, it stops being your test data.
I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6
Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.

Faceting using SolrJ and Solr4

I've gone through the related questions on this site but haven't found a relevant solution.
When querying my Solr4 index using an HTTP request of the form
&facet=true&facet.field=country
The response contains all the different countries along with counts per country.
How can I get this information using SolrJ?
I have tried the following but it only returns total counts across all countries, not per country:
solrQuery.setFacet(true);
solrQuery.addFacetField("country");
The following does seem to work, but I do not want to have to explicitly set all the groupings beforehand:
solrQuery.addFacetQuery("country:usa");
solrQuery.addFacetQuery("country:canada");
Secondly, I'm not sure how to extract the facet data from the QueryResponse object.
So two questions:
1) Using SolrJ how can I facet on a field and return the groupings without explicitly specifying the groups?
2) Using SolrJ how can I extract the facet data from the QueryResponse object?
Thanks.
Update:
I also tried something similar to Sergey's response (below).
List<FacetField> ffList = resp.getFacetFields();
log.info("size of ffList:" + ffList.size());
for(FacetField ff : ffList){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
log.info("ffname:" + ffname + "|ffcount:" + ffcount);
}
The above code shows ffList with size=1 and the loop goes through 1 iteration. In the output ffname="country" and ffcount is the total number of rows that match the original query.
There is no per-country breakdown here.
I should mention that on the same solrQuery object I am also calling addField and addFilterQuery. Not sure if this impacts faceting:
solrQuery.addField("user-name");
solrQuery.addField("user-bio");
solrQuery.addField("country");
solrQuery.addFilterQuery("user-bio:" + "(Apple OR Google OR Facebook)");
Update 2:
I think I got it, again based on what Sergey said below. I extracted the List object using FacetField.getValues().
List<FacetField> fflist = resp.getFacetFields();
for(FacetField ff : fflist){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
List<Count> counts = ff.getValues();
for(Count c : counts){
String facetLabel = c.getName();
long facetCount = c.getCount();
}
}
In the above code the label variable matches each facet group and count is the corresponding count for that grouping.
Actually you need only to set facet field and facet will be activated (check SolrJ source code):
solrQuery.addFacetField("country");
Where did you look for facet information? It must be in QueryResponse.getFacetFields (getValues.getCount)
In the solr Response you should use QueryResponse.getFacetFields() to get List of FacetFields among which figure "country". so "country" is idenditfied by QueryResponse.getFacetFields().get(0)
you iterate then over it to get List of Count objects using
QueryResponse.getFacetFields().get(0).getValues().get(i)
and get value name of facet using QueryResponse.getFacetFields().get(0).getValues().get(i).getName()
and the corresponding weight using
QueryResponse.getFacetFields().get(0).getValues().get(i).getCount()

Categories