Computing preference values in Apache Mahout - java

I am trying to learn Apache mahout, very new to this topic. I want to implement user-based recommender. For this, after exploring on the internet I have found some samples like below,
public static void main(String[] args) {
try {
int userId = 2;
DataModel model = new FileDataModel(new File("data/mydataset.csv"), ";");
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(userId, 3);
for (RecommendedItem recommendation : recommendations) {
logger.log(Level.INFO, "Item Id recommended : " + recommendation.getItemID() + " Ratings : "
+ recommendation.getValue() + " For UserId : " + userId);
}
} catch (Exception e) {
logger.log(Level.SEVERE, "Exception in main() ::", e);
}
I am using following dataset which contains userid, itemid, preference value respectively,
1,10,1.0
1,11,2.0
1,12,5.0
1,13,5.0
1,14,5.0
1,15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
3,12,4.5
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
In this case, it works fine, but my main question is I have the different set of data which don't have preference values, which contains some data based on that I am thinking to compute preference values. Following is my new dataset,
userid itemid likes shares comments
1 4 1 20 3
2 6 18 20 12
3 12 10 2 20
4 7 0 20 13
5 9 0 2 1
6 5 5 3 2
7 3 9 7 0
8 1 15 0 0
My question is how can I compute preference value for a particular record based on some other columns such as likes, shares, comments etc. Is there anyway to compute this in mahout?

Yes- I think your snippet is from an older version of Mahout, but what you want to use is the Correlated Co Occurrence recommender. The CCO Recommender is multi-modal (allows user to have various inputs).
There are CLI Drivers, but I'm guessing you want to code, there is a Scala tutorial here
In the tutorial I think it recommends 'friends' based on genres tagged and artists 'liked', as well as your current friends.

As #rawkintrevo says, Mahout has moved on from the older "taste" recommenders and they will be deprecated from Mahout soon.
You can build you own system from the CCO algorithm in Mahout here. It allows you to use data from different user behavior like "likes, shares, comments". So we call it multi-modal.
Or in another project we have created a full featured recommendation server based on Mahout, called the Universal Recommender. It is build on Apache PredicitonIO where the UR is a plugin called a Template. Together they deliver a nearly turnkey server that takes input and responds to queries. To get started easily try the AWS AMI that has the whole system working. Some other methods to install are shown here.
This is all Apache licensed OSS, but Mahout no longer can really provide a production ready environment, Mahout does algorithms but you need a system around it. Build your own or try the PredictionIO based one. Since everything is OSS you can tweak things if needed.

Related

Cassandra, Java and MANY Async request : is this good?

I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you
Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.
I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )

How to get Dynamic Xpath of a webtable?

List<WebElement>table=driver.findElements(By.xpath("//*[#id=\"prodDetails\"]/div[2]/div[1]/div/div[2]/div/div/table/tbody/tr"));
JavascriptExecutor jse = (JavascriptExecutor) driver;
// jse.executeScript("arguments[0].scrollIntoView();",table);
jse.executeScript("arguments[0].style.border='3px solid red'",table);
int row= table.size();
I am unable to get the required no of row and column.The xpath i provided does not find the table on site
Link : Click here
I have to fetch the specification of the mobile.
Instead of this xpath :
//*[#id=\"prodDetails\"]/div[2]/div[1]/div/div[2]/div/div/table/tbody/tr
Use this :
//*[#id="prodDetails"]/div[2]/div[1]/div/div[2]/div/div/table/tbody/tr
Though I would not suggest you to use absolute xpath. You can go for relative xpath which is more readable and easy.
Relative Xpath :
//div[#id='prodDetails']/descendant::div[#class='pdTab'][1]/descendant::tbody/tr
In code something like :
List<WebElement>table=driver.findElements(By.xpath("//div[#id='prodDetails']/descendant::div[#class='pdTab'][1]/descendant::tbody/tr"));
Instead of absolute xpath:
//*[#id=\"prodDetails\"]/div[2]/div[1]/div/div[2]/div/div/table/tbody/tr
I would suggest to use simple relative xpath:
//*[#id='prodDetails']//table/tbody/tr
This xpath will work if there are no other tables in the page. Otherwise, make sure both the tables can be differentiated with some attribute
You can get the total no of rows using the below Xpath.
In the above link, we are having multiple section which has same class and two table also has similar locator. So, You need to get the element based on the table name as below
Note: you can achieve this without using JavascriptExecutor
WebDriverWait wait=new WebDriverWait(driver,20);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//div[#class='section techD']//span[text()='Technical Details']/ancestor::div[#class='section techD']//table//tr")));
List<WebElement> rowElementList=driver.findElements(By.xpath("//div[#class='section techD']//span[text()='Technical Details']/ancestor::div[#class='section techD']//table//tr"));
int row= rowElementList.size();
System.out.println(row);//16
output:
16
Suppose , If you want to get an additional information table row details, you can use the above Xpath by replace the section as Additional Information
List<WebElement> additionInfoList=driver.findElements(By.xpath("//div[#class='section techD']//span[text()='Additional Information']/ancestor::div[#class='section techD']//table//tr"));
System.out.println(additionInfoList.size());//Output: 5
Output: 5
Finally, you can iterate above list and extract table content details
XPATH can be pretty hard to read, especially when you need to use it a lot.
You could try the univocity-html-parser
HtmlElement e = HtmlParser.parseTree(new UrlReaderProvider("your_url"));
List<HtmlElement> rows = e.query()
.match("div").precededBy("div").withExactText("Technical Details")
.match("tr").getElements();
for(HtmlElement row : rows){
System.out.println(row.text());
}
The above code will print out:
OS Android
RAM 2 GB
Item Weight 150 g
Product Dimensions 7.2 x 14.2 x 0.9 cm
Batteries: 1 AA batteries required. (included)
Item model number G-550FY
Wireless communication technologies Bluetooth, WiFi Hotspot
Connectivity technologies GSM, (850/900/1800/1900 MHz), 4G LTE, (2300/2100/1900/1800/850/900 MHz)
Special features Dual SIM, GPS, Music Player, Video Player, FM Radio, Accelerometer, Proximity sensor, E-mail
Other camera features 8MP primary & 5MP front
Form factor Touchscreen Phone
Weight 150 Grams
Colour Gold
Battery Power Rating 2600
Whats in the box Handset, Travel Adaptor, USB Cable and User Guide
Alternatively the following code is a bit more usable, as I believe you probably want more stuff from that page too, and getting rows with data is usually what you want to end up with:
HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings product = entityList.configureEntity("product");
PartialPath technicalDetailRows = product.newPath()
.match("div").precededBy("div").withExactText("Technical Details")
.match("tr");
technicalDetailRows.addField("technical_detail_field").matchFirst("td").classes("label").getText();
technicalDetailRows.addField("technical_detail_value").matchLast("td").classes("value").getText();
HtmlParserResult results = new HtmlParser(entityList).parse(new UrlReaderProvider("your_url")).get("product");
System.out.println("-- " + Arrays.toString(results.getHeaders()) + " --");
for(String[] row : results.getRows()){
System.out.println(Arrays.toString(row));
}
Now this produces:
OS = Android
RAM = 2 GB
Item Weight = 150 g
Product Dimensions = 7.2 x 14.2 x 0.9 cm
Batteries: = 1 AA batteries required. (included)
Item model number = G-550FY
Wireless communication technologies = Bluetooth, WiFi Hotspot
Connectivity technologies = GSM, (850/900/1800/1900 MHz), 4G LTE, (2300/2100/1900/1800/850/900 MHz)
Special features = Dual SIM, GPS, Music Player, Video Player, FM Radio, Accelerometer, Proximity sensor, E-mail
Other camera features = 8MP primary & 5MP front
Form factor = Touchscreen Phone
Weight = 150 Grams
Colour = Gold
Battery Power Rating = 2600
Whats in the box = Handset, Travel Adaptor, USB Cable and User Guide
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.

Get N terms with top TFIDF scores for each documents in Lucene (PyLucene)

I am currently using PyLucene but since there is no documentation for it, I guess a solution in Java for Lucene will also do (but if anyone has one in Python it would be even better).
I am working with scientific publications and for now, I retrieve the keywords of those. However, for some documents there are simply no keywords. An alternative to this would be to get N words (5-8) with the highest TFIDF scores.
I am not sure how to do it, and also when. By when, I mean : Do I have to tell Lucene at the stage of indexing to compute these values, of it is possible to do it when searching the index.
What I would like to have for each query would be something like this :
Query Ranking
Document1, top 5 TFIDF terms, Lucene score (default TFIDF)
Document2, " " , " "
...
What would also be possible is to first retrieve the ranking for the query, and then compute the top 5 TFIDF terms for each of these documents.
Does anyone have an idea how shall I do this ?
If a field is indexed, document frequencies can be retrieved with getTerms. If a field has stored term vectors, term frequencies can be retrieved with getTermVector.
I also suggest looking at MoreLikeThis, which uses tf*idf to create a query similar to the document, from which you can extract the terms.
And if you'd like a more pythonic interface, that was my motivation for lupyne:
from lupyne import engine
searcher = engine.IndexSearcher(<filepath>)
df = dict(searcher.terms(<field>, counts=True))
tf = dict(searcher.termvector(<docnum>, <field>, counts=True))
query = searcher.morelikethis(<docnum>, <field>)
After digging a bit in the mailing list, I ended up having what I was looking for.
Here is the method I came up with :
def getTopTFIDFTerms(docID, reader):
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
tc_dict = {} # Counts of each term
dc_dict = {} # Number of docs associated with each term
tfidf_dict = {} # TF-IDF values of each term in the doc
N_terms = 0
try:
while (termsref.next()):
termval = TermsEnum.cast_(termsref)
fg = termval.term().utf8ToString() # Term in unicode
tc = termval.totalTermFreq() # Term count in the doc
# Number of docs having this term in the index
dc = reader.docFreq(Term("contents", termval.term()))
N_terms = N_terms + 1
tc_dict[fg]=tc
dc_dict[fg]=dc
except:
print 'error in term_dict'
# Compute TF-IDF for each term
for term in tc_dict:
tf = tc_dict[term] / N_terms
idf = 1 + math.log(N_DOCS_INDEX/(dc_dict[term]+1))
tfidf_dict[term] = tf*idf
# Here I get a representation of the sorted dictionary
sorted_x = sorted(tfidf_dict.items(), key=operator.itemgetter(1), reverse=True)
# Get the top 5
top5 = [i[0] for i in sorted_x[:5]] # replace 5 by TOP N
I am not sure why I have to cast the termsEnum as a BytesRefIterator, I got this from a thread in the mailing list which can be found here
Hope this will help :)

Get cluster assignments in Weka

I have a CSV file as follows:
id,at1,at2,at3
1072,0.5,0.2,0.7
1092,0.2,0.5,0.7
...
I've loaded it in in Weka for clustering:
DataSource source = new DataSource("test.csv");
Instances data = source.getDataSet();
kmeans.buildClusterer(data);
Question #1: How do I set the first column as an ID? ie. ignoring the first column for clustering purposes.
I then try to print out the assignments:
int[] assignments = kmeans.getAssignments();
int i = 0;
for (int clusterNum : assignments) {
System.out.printf("Instance %d -> Cluster %d \n", i, clusterNum);
i++;
}
This prints:
Instance 1 -> Cluster 0
Instance 2 -> Cluster 2
...
Question #2: How do I refer to the ID when printing out the assignments? For example:
Instance 1072 -> Cluster 0
Instance 1092 -> Cluster 2
I realize this is an old question, but I came here looking for an answer as well, and then was able to figure it out myself, so putting my solution here for the next person with this problem. In my case, the clustering component is part of a Java application, so I don't have the option of using the Weka workbench. Here is what I did to pull out the id along with the cluster assignments.
int[] assignments = kmeans.getAssignments();
for (int i = 0; i < assignments.length; i++) {
int id = (int) data.instance(i).value(0); // cast from double
System.out.printf("ID %d -> Cluster %d \n", id, assignments[i]);
}
Unlike the OP, I did not build my Instances from DataSource.getDataSet(), I built this manually from a database table, but the id field was the first one in my case as well, so I think the code above should work. I had a custom distance function that skipped the id field when computing similarity.
Your life would be much easier if you use Windows version of Weka with GUI.
In cluster tab there is a button for ignoring attributes like ID.
And for Id to cluster assignments; after your are done with clustering algorithm you chose, right click the result on left of the screen, then visualize results and then save.

Faceting using SolrJ and Solr4

I've gone through the related questions on this site but haven't found a relevant solution.
When querying my Solr4 index using an HTTP request of the form
&facet=true&facet.field=country
The response contains all the different countries along with counts per country.
How can I get this information using SolrJ?
I have tried the following but it only returns total counts across all countries, not per country:
solrQuery.setFacet(true);
solrQuery.addFacetField("country");
The following does seem to work, but I do not want to have to explicitly set all the groupings beforehand:
solrQuery.addFacetQuery("country:usa");
solrQuery.addFacetQuery("country:canada");
Secondly, I'm not sure how to extract the facet data from the QueryResponse object.
So two questions:
1) Using SolrJ how can I facet on a field and return the groupings without explicitly specifying the groups?
2) Using SolrJ how can I extract the facet data from the QueryResponse object?
Thanks.
Update:
I also tried something similar to Sergey's response (below).
List<FacetField> ffList = resp.getFacetFields();
log.info("size of ffList:" + ffList.size());
for(FacetField ff : ffList){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
log.info("ffname:" + ffname + "|ffcount:" + ffcount);
}
The above code shows ffList with size=1 and the loop goes through 1 iteration. In the output ffname="country" and ffcount is the total number of rows that match the original query.
There is no per-country breakdown here.
I should mention that on the same solrQuery object I am also calling addField and addFilterQuery. Not sure if this impacts faceting:
solrQuery.addField("user-name");
solrQuery.addField("user-bio");
solrQuery.addField("country");
solrQuery.addFilterQuery("user-bio:" + "(Apple OR Google OR Facebook)");
Update 2:
I think I got it, again based on what Sergey said below. I extracted the List object using FacetField.getValues().
List<FacetField> fflist = resp.getFacetFields();
for(FacetField ff : fflist){
String ffname = ff.getName();
int ffcount = ff.getValueCount();
List<Count> counts = ff.getValues();
for(Count c : counts){
String facetLabel = c.getName();
long facetCount = c.getCount();
}
}
In the above code the label variable matches each facet group and count is the corresponding count for that grouping.
Actually you need only to set facet field and facet will be activated (check SolrJ source code):
solrQuery.addFacetField("country");
Where did you look for facet information? It must be in QueryResponse.getFacetFields (getValues.getCount)
In the solr Response you should use QueryResponse.getFacetFields() to get List of FacetFields among which figure "country". so "country" is idenditfied by QueryResponse.getFacetFields().get(0)
you iterate then over it to get List of Count objects using
QueryResponse.getFacetFields().get(0).getValues().get(i)
and get value name of facet using QueryResponse.getFacetFields().get(0).getValues().get(i).getName()
and the corresponding weight using
QueryResponse.getFacetFields().get(0).getValues().get(i).getCount()

Categories