Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.
Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.
Here's a skeleton script to fit a linear regression model using scikit-learn(*):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor
# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents) # documents: list of filenames
# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...
# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)
That's it. If you now have new documents for which you want to predict the duration of interest, do
Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)
This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.
(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.
(The following is based on my academic "experience", but seems informative enough to post it).
It looks like your task can be reformulated as:
Given a training set of scored documents, design a system for scoring
arbitrary documents based on their content.
"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous.
You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).
You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.
Depending on what values sim() would return, the measure should fullfill a relationship like:
sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.
To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.
The first pick would be the cosine similarity using tf-idf
You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.
If you had a large corpus of the documents, you could try clustering to speed up the process.
Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".
As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.
(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)
Related
I'm working on a Java project where I need to match user queries against several engines.
Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.
Example: user searches "Dragon Ball".
Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)
I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.
Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.
Thank you!
The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.
Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...
But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.
Consider we have 16 different categories, e.g., Computer, Science, Art, Business etc. We have some words under each category as synonyms, homonyms etc which describes the possible meaning of each topic and its range. Consequently there might be similar or even same words which falls in more than one categories. Our aim is to submit a query (with max length of 3, after stop word removal) to a system and ask the system to put this word into the category with highest similarity. So my question is, beside Cosine similarity, is there any good technique for doing this?
I know already about WordNet and its extended version, extjwnl, however, I wish to implement one which gives enough flexibility to me for small usages.
So, there are few things that can be done over and above to improve performance like stemming, lemmatization so that similarity is properly calculated.
Now as per similarity is concerned , you can use LDA(latent dirichlet allocation) to consider each document as combination of multiple topics.
LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you
Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
https://www.cs.princeton.edu/~blei/topicmodeling.html
Although this is unsupervised training where topics(categories are latent), you can use extension of LDA called LLDA(Labeled LDA).
I would not recommend using wordnet and cosine similarity as they doesn't consider co-occurences of terms and therefore might not work good with all datasets.
Jaccard Similarity can also be used in your case.
Jaccard Similarity converts a sentence into a set and then finds the intersection between the documents on which we have to find the similarity.
For more information on Jaccard Similarity you could take a look at https://en.wikipedia.org/wiki/Jaccard_index
I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.
It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.
There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).
My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.
If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.
I'm working with the Mahout framework in order to get recommendations in implicit feedback context using the well-known movielens dataset (ml-100k) that I have binarized considering 1 all the ratings equal to four or five, zero all the other.
In this dataset there are five split, each of which divided in test set and training set
as usually.
In the recommendation process I train the recommender using a simple GenericBooleanPrefUserBasedRecommender and the TanimotoCoefficientSimilarity as described in these lines of code:
DataModel trainModel = new FileDataModel(new File(String.valueOf(Main.class.getResource("/binarized/u1.base").getFile())));
DataModel testModel = new FileDataModel(new File(String.valueOf(Main.class.getResource("/binarized/u1.test").getFile())));
UserSimilarity similarity = new TanimotoCoefficientSimilarity(trainModel);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(35, similarity, trainModel);
GenericBooleanPrefUserBasedRecommender userBased = new GenericBooleanPrefUserBasedRecommender(trainModel, neighborhood, similarity);
long firstUser = testModel.getUserIDs().nextLong(); // get the first user
// try to recommender items for the first user
for(LongPrimitiveIterator iterItem = testModel.getItemIDsFromUser(firstUser).iterator(); iterItem.hasNext(); ) {
long currItem = iterItem.nextLong();
// estimates preference for the current item for the first user
System.out.println("Estimated preference for item " + currItem + " is " + userBased.estimatePreference(firstUser, currItem));
}
When I execute this code, the result is a list of 0.0 or 1.0 which are not useful in the context of top-n recommendation in implicit feedback context. Simply because I have to obtain, for each item, an estimated rate which stays in the range [0, 1] in order to rank the list in decreasing order and construct the top-n recommendation appropriately.
So what's the problem with this code? Have I missed something or something was incorrect?
Or maybe is the Mahout framework that doesn't provide a proper way of using binary feedback?
Thank you in advance,
Alessandro Suglia
If you want recommendations you are calling the wrong function. You have to call recommend
List<RecommendedItem> items = userBased.recommend(firstUser, 10);
for(RecommendedItem item : items) {
System.out.println(item.getItemID()+" Estimated preference: "+item.getValue());
}
More information can be found at the javadocs:
https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/Recommender.html
An extensive code example can be found here:
https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/FacebookRecommender.java
If you are trying to evaluation the recommender offline and you are using the in-memory item or user based recommender then Mahout has an evaluation framework for this. It will split the data into training and test automatically and randomly. It trains on the training set and run an evaluation on the test set giving back several metrics.
Check out the "Evaluation" section at the bottom of the wiki page here:
https://mahout.apache.org/users/recommender/userbased-5-minutes.html
Each run of this will yield slightly different results due to the random hold out set.
I would caution about doing this across different recommender algorithms since the test is only checking one against itself. To compare two algorithms or implementations is more complicated. Be sure to use exactly the same data, training and test splits, and even then the results are questionable until you do A/B user testing.
Update:
offline you said you are using a particular evaluation system and can't use Mahout's--no matter. Here is how it's done:
You can remove some data from the dataset. So remove certain preferences. Then train and obtain recommendations for the user’s who had some data withheld. The test data has not been used to train and get recs so you then compare what users’ actually preferred to the prediction made by the recommender. If all of them match you have 100% precision. Note that you are comparing recommendations to actual but held-out preferences.
If you are using some special tools you may be doing this to compare algorithms, which is not an exact thing at all, no matter what the Netflix prize may have led us to believe. If you are using offline tests to tune a specific recommender you may have better luck with the results.
In one installation we had real data and split it into test and training by date. 90% of older data was used to train, the most recent 10% was used to test. This would mimic the way data comes in. We compared the recommendations from the training data against the actual preferences in the help-out data and used MAP#some-number-of-recs as the score. This allows you to measure ranking, where RMSE does not. The Map score led us to several useful conclusions about tuning that were data dependent.
http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision
I am working on a project where I need to group sentences based on how similar they are.
For Example, these sentences need to be grouped into a single cluster:
Apple's monster Q1 earnings still fall short on Wall Street
Apple announces Q1 2013 earnings: record $54.5 billion in revenue.
Apple posts record revenue and profits; iPhone sales jump nearly 30%.
The titles keep coming in, so I might need to arrange and modify the clusters on the fly. Currently I am using the Monge-Elkan algorithm to identify how similar two strings are, but I don't know how to cluster them.
Searching on the internet leads me to believe that I need to use K-Means algorithm to group content, but I am not sure how to proceed with what I have.
What makes matters slightly complicated is the fact that I have hosted it on Google App Engine, so I can't use File System.
Edit distance metrics are unlikely to effectively model the similarity of the meaning of sentences, which I assume you are after. Same goes for the low-level representation of text as a string of characters.
A better approach is to use a higher-level representation, such as the vector-space model. Here you collect all the unique words in your sentence collection(corpus) and map each of them to a number. Each document(sentence) is then represented as a vector:
[w1_count, w2_count, ..., wN_count]
Where N'th element is the count of N'th word (the word mapped to number N) in given sentence.
Now you can run k-means on this dataset, but better:
Process the data so that the important words such as 'Apple' are given more weight that common words such as 'on' or 'in'. One such technique is TF-IDF. Then run standard k-means on this with euclidean distance.
Even better, use an even higher-level tool such as Latent Semantic Analysis or Latent Dirichlet Allocation.
If you want to use your existing approach, Simon G.'s answer points you in the right direction and similarity to distance coversion is answered in this question.
First, change your similarities into dissimlarities so that they can be thought of as distances
Second, use a multidimensional scaling library to change the distances into points in space.
Third, use regular k-means on the points in space.