I'm working with the Mahout framework in order to get recommendations in implicit feedback context using the well-known movielens dataset (ml-100k) that I have binarized considering 1 all the ratings equal to four or five, zero all the other.
In this dataset there are five split, each of which divided in test set and training set
as usually.
In the recommendation process I train the recommender using a simple GenericBooleanPrefUserBasedRecommender and the TanimotoCoefficientSimilarity as described in these lines of code:
DataModel trainModel = new FileDataModel(new File(String.valueOf(Main.class.getResource("/binarized/u1.base").getFile())));
DataModel testModel = new FileDataModel(new File(String.valueOf(Main.class.getResource("/binarized/u1.test").getFile())));
UserSimilarity similarity = new TanimotoCoefficientSimilarity(trainModel);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(35, similarity, trainModel);
GenericBooleanPrefUserBasedRecommender userBased = new GenericBooleanPrefUserBasedRecommender(trainModel, neighborhood, similarity);
long firstUser = testModel.getUserIDs().nextLong(); // get the first user
// try to recommender items for the first user
for(LongPrimitiveIterator iterItem = testModel.getItemIDsFromUser(firstUser).iterator(); iterItem.hasNext(); ) {
long currItem = iterItem.nextLong();
// estimates preference for the current item for the first user
System.out.println("Estimated preference for item " + currItem + " is " + userBased.estimatePreference(firstUser, currItem));
}
When I execute this code, the result is a list of 0.0 or 1.0 which are not useful in the context of top-n recommendation in implicit feedback context. Simply because I have to obtain, for each item, an estimated rate which stays in the range [0, 1] in order to rank the list in decreasing order and construct the top-n recommendation appropriately.
So what's the problem with this code? Have I missed something or something was incorrect?
Or maybe is the Mahout framework that doesn't provide a proper way of using binary feedback?
Thank you in advance,
Alessandro Suglia
If you want recommendations you are calling the wrong function. You have to call recommend
List<RecommendedItem> items = userBased.recommend(firstUser, 10);
for(RecommendedItem item : items) {
System.out.println(item.getItemID()+" Estimated preference: "+item.getValue());
}
More information can be found at the javadocs:
https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/Recommender.html
An extensive code example can be found here:
https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/FacebookRecommender.java
If you are trying to evaluation the recommender offline and you are using the in-memory item or user based recommender then Mahout has an evaluation framework for this. It will split the data into training and test automatically and randomly. It trains on the training set and run an evaluation on the test set giving back several metrics.
Check out the "Evaluation" section at the bottom of the wiki page here:
https://mahout.apache.org/users/recommender/userbased-5-minutes.html
Each run of this will yield slightly different results due to the random hold out set.
I would caution about doing this across different recommender algorithms since the test is only checking one against itself. To compare two algorithms or implementations is more complicated. Be sure to use exactly the same data, training and test splits, and even then the results are questionable until you do A/B user testing.
Update:
offline you said you are using a particular evaluation system and can't use Mahout's--no matter. Here is how it's done:
You can remove some data from the dataset. So remove certain preferences. Then train and obtain recommendations for the user’s who had some data withheld. The test data has not been used to train and get recs so you then compare what users’ actually preferred to the prediction made by the recommender. If all of them match you have 100% precision. Note that you are comparing recommendations to actual but held-out preferences.
If you are using some special tools you may be doing this to compare algorithms, which is not an exact thing at all, no matter what the Netflix prize may have led us to believe. If you are using offline tests to tune a specific recommender you may have better luck with the results.
In one installation we had real data and split it into test and training by date. 90% of older data was used to train, the most recent 10% was used to test. This would mimic the way data comes in. We compared the recommendations from the training data against the actual preferences in the help-out data and used MAP#some-number-of-recs as the score. This allows you to measure ranking, where RMSE does not. The Map score led us to several useful conclusions about tuning that were data dependent.
http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
Related
I am new to Apache Mahout recommender. The use case involves providing suggestions to users based on their purchase history.
I am planning to use the following information :
Purchase category
Purchase amount
Time of purchase (Example - recommend a pair of denims 6 months after the first pair was bought)
Location of user
To identify users with similar purchase pattern/time of purchase and give them more preference, do I have to make custom data model for every user?
I was planning to import from the database periodically to recreate the data model.
Is there a way to dynamically give preference like mentioned below:
Location + purchase category + time match
Purchase category + time match
Location + time match (example winter clothing)
Currently I am using the sample code provided. (A lot of modifications are needed)
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(74, 10);
In general, to achieve what you are sugguesting you need to do a step on your data where you add a feature like t_since_last_purchase which in an integer 0 -> inf. E.g. days since last purchase.
This feature, time, will be another user feature which is correlated.
I think you are looking at some of the older Map-Reduce based reccomenders- which are in fact first class- but given your use case, you might want to check out coorelated cooccurence based reccomenders which have a significant benefit of being able to look at multiple activities of the user (in your case, location, previous purchases, time).
I'm building a recommender where the actual similarity computation is done with the ItemSimilarityJob and which is then loaded into a non distributed recommender through FileItemSimilarity.
All this works so far(2), but there's one thing I'm a bit puzzled about.
When instantiating the recommender (GenericItemBasedRecommender), I've to pass along a data model - which would be FileDataModel in my case, but due to the fact that the similarity computation already took place, I don't really know what data I should pass into the model?
Clearly the model is used to determine maximum and minimum preference value and item- and user-ids. Regarding the users I'm planning to have only anonymous "profiles" anyways - so would it then be ok to pass along fake data?
How's that supports to work - the Mahout examples (1) and the MiA book don't give any answers on that but both state that pre-computation is the way to go :(
(1) I'm running on Mahout 0.7 but also looked into trunk already.
(2) I had to transfer the generated similarity matrix into a textual format myself of course.
You should pass the same DataModel that was fed to the similarity computation. The recommender's output is certainly a function of the similarities, but, also the original data of course! That's why it's an input.
You could in theory build similarities off a different DataModel than the data you are actually making recommendations from. It's possible and might make sense in some cases but is not normal.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.
Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.
Here's a skeleton script to fit a linear regression model using scikit-learn(*):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor
# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents) # documents: list of filenames
# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...
# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)
That's it. If you now have new documents for which you want to predict the duration of interest, do
Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)
This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.
(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.
(The following is based on my academic "experience", but seems informative enough to post it).
It looks like your task can be reformulated as:
Given a training set of scored documents, design a system for scoring
arbitrary documents based on their content.
"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous.
You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).
You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.
Depending on what values sim() would return, the measure should fullfill a relationship like:
sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.
To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.
The first pick would be the cosine similarity using tf-idf
You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.
If you had a large corpus of the documents, you could try clustering to speed up the process.
Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".
As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.
(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?
This is a classification problem, where you essentially want to learn a function y(x) which predicts whether 'x', an unlabeled tweet, belongs in the class 'valuable' or in the class 'not valuable'.
The trickiest bits here are not the algorithm (Naive Bayes is just counting and multiplying and is easy to code!) but:
Gathering the training data
Defining the optimal feature set
For one, I suggest you track tweets that the user favorites, replies to, and retweets, and for the second, look at qualities like who wrote the tweet, the words in the tweet, and whether it contains a link or not.
Doing this well is not easy. Google would love to be able to do such things ("What links will the user value"), as would Netflix ("What movies will they value") and many others. In fact, you'd probably do well to read through the notes about the winning entry for the Netflix Prize.
Then you need to extract a bunch of features, as #hmason says. And then you need an appropriate machine learning algorithm; you either need a function approximator (where you try to use your features to predict a value between, say, 0 and 1, where 1 is "best tweet ever" and 0 is "omg who cares") or a classifier (where you use your features to try to predict whether it's a "good" or "bad" tweet).
If you go for the latter--which makes user-training easy, since they just have to score tweets with "like" (to mix social network metaphors)--then you typically do best with support vector machines, for which there exists a fairly comprehensive Java library.
In the former case, there are a variety of techniques that might be worth trying; if you decide to use the LIBSVM library, they have variants for regression (i.e. parameter estimation) as well.