I was looking a price comparison site like this. So the question is how it knows two products from two different sites to be of same product and clubs the two to same bucket to show the price comparison.
If it is only books than i can understand that all books have unique ISBN number so just write some website specific code which will fetch data from the websites and compare.
e.g. you have two websites:
www.xyz.com
www.pqr.com
Now these two websites list their books differently i.e. the html will be different, so parse the HTML and fetch ISBN, price from it. Than for corresponding ISBN we can put the two website's price. It is simple, but how you will parse for products which does not have an id which is unique and uniform (like presser cooker, watch etc…) across websites like ISBN.
Thanks.
Other products also have identification numbers, in Europe it is the EAN which is currently turned into a global number called GTIN. In ecommerce usually Amazon IDs (ASIN, of which ISBN is a subset) are often used.
If you don't have these numbers available, which is usually the case, you will need a strategy called Record Linkage or Data Matching.
TL;DR It usually uses a string matching algorithm to find similar "worded" products (using an inverted index on n-grams for example). In the end you can use machine-learning to remove the wrong matches (false-positives). This requires a lot of training data (there are no or too small public datasets available) and thus most of the time a human will check those matches.
For a more detailed analysis of the problem I can only recommend reading the book Data Matching by Peter Christen. It goes deep into information retrieval (how to find similar products) and then how to sort out wrong or right matches using machine-learning (e.g. via structural analysis).
There are also plenty of papers by him available on the net, so checkout his scholar profile.
Related
Im trying to better organise the types of tasks regularly sent to my team based off of the titles and short comment people enter.
Our team only handles a handful of issues (maybe 10 or so) different types of tasks, so I've put together a list of common words used within the description of a particular type of task and i've been using this to categorise the issues. for example.... an issue might come through like "User x doesn't have access to office after hours, please update their swipecard access level". what i've got so far is if the comments contain 'swipecard' or 'access', its a building access type request.
I've quickly found myself with code that's LOTS of ... if contains, and if !contains...
Is there a neater way of doing what im after?
If you want to make it complex, it sounds like you have a classification problem.
If you want to keep it simple, you're probably on the right track with your if statements and contains(). To get to a cleaner solution, I would approach it as follows:
Create a class to modify your categories - give it two attributes: String categoryName, List<String> commonlyUsedWords;
Populate a list with instances of that class - one per type.
For each issue, loop through the list of categories and check how many words match, and store that as a percentage (e.g. 8 out of 10 words match, therefore 80% match).
Return the category with the highest match rate.
I'm working on the Wikipedia Category Graph (WCG). In the WCG, each article is associated to multiple categories.
For example, the article "Lists_of_Israeli_footballers" is linked to multiple categories, such as :
Lists of association football players by nationality - Israeli footballers - Association football in Israel lists
Now, if you climb back the category tree, you are likely to find a lot of paths climbing up to the "Football" category, but there is also at least one path leading up to "Science" for example.
This is problematic because my final goal is to be able to determinate whether or not an article belongs to a given Category using the list of categories it's linked with : right now a simple ancestor search gives false positives (for example : identifies "Israeli footballers" as part of the "Science" category - which is obviously not the expected result).
I want an algorithm able to find out what the most likely ancestor is.
I thought about two main solutions :
Count the number of distinct paths in the WCG linking article's category vertices to the candidate ancestor category (and use number of paths linking to other categories of same depth for comparison)
Use some kind of clustering algorithm and make ancestor search queries in isolated graph spaces
The issue with those options is that they seem to be very costly considering the size of the WCG (2 million vertices - even more edges). Eventually, I could work with a solution that uses a preprocessing algorithm in O(n) or more to achieve O(1) later, but I need the queries to be overall very fast.
Are there existing solutions to my problem ? Open to all suggestions.
Np, thanks for clarifying. anything like clustering is probably not a good idea, because those type of algorithms are meant to determine a category for an object that is not associated with a category yet. In your problem all objects (footballer article) is already associated to different categories.
You should probably do a complete search through all articles and save the matched categories with each article in a hash table so that you can then retrieve this category information when you need to know this for a new article.
Whether or not a category is relevant for an article seems totally arbitrary to me and seems to be something you should decide for yourself (e.g. determine a threshhold of 5 links to a category before it is determined part of the category).
If you're getting these articles from wikipedia you're probably going to have a pretty long run working through the entire tree, but in my opinion it seems like it's your only choice.
Search with DFS, and each time you find an arcticle-category match save the article in a hashtable (you need to be able to reduce an article to a unique identifier).
This is probably my most vague answer I've ever posted here, and your question might be too broad... if you're not helped with this please let me know so I can consider removing it in order to avoid confusion with future readers.
I am planning to create an affiliate site (Price Comparison site).
As you all know that DATA (products and their Info.) from different sites(Ecomm sites) plays a vital role in these type of price comparison sites.
I have already wrote scripts to scrap data for products from the sites of my interest and its working as expected.
In more detail, I am scrapping following common parameters and storing them in my DB.
1)product Title , 2) Product Description , 3) Price , 4) Pay modes etc.
[FYI: I used JSOUP APIs to scrap data]
PROBLEM STARTS HERE:
I want to group products [same product] from different sources which I
scrapped from these sites.
To illustrate my questing:
Say XYZ is product sold on 5 different sites with some changes in Its PRODUCT TITLE.
I scrapped data from these 5 sites saved it to my DB now how should I effectively group these products to single group. so that I can show 5 different sources on single page of my site.
I do not have any clue that how should I proceed in it.
[String comparison is first thought that comes to my mind but do not think that i'll work in long run.]
Any suggestions / Recommendation are welcomed and appreciated.
I you require any further information please do not hesitate to add comments.
-JS
At initial phase you can use solr for getting best score while comparison between product title or moreover its descriptions.
More in depth if we think about user side, why a product is consider as common product. these are the features which makes product common. like brand, color , material blah blah....
Make a dictionary of feature set for different catalog which should be same while declaring any product as common product.
it may be possible then for a same feature set we have many products to identify, in this case u can take help from solr for scoring...
Moreover You can check google image search api which at the end help to get image similarity scoring. this will be helpful in finding of common products for fashion catalogues
Hope it will help...
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?
This is a classification problem, where you essentially want to learn a function y(x) which predicts whether 'x', an unlabeled tweet, belongs in the class 'valuable' or in the class 'not valuable'.
The trickiest bits here are not the algorithm (Naive Bayes is just counting and multiplying and is easy to code!) but:
Gathering the training data
Defining the optimal feature set
For one, I suggest you track tweets that the user favorites, replies to, and retweets, and for the second, look at qualities like who wrote the tweet, the words in the tweet, and whether it contains a link or not.
Doing this well is not easy. Google would love to be able to do such things ("What links will the user value"), as would Netflix ("What movies will they value") and many others. In fact, you'd probably do well to read through the notes about the winning entry for the Netflix Prize.
Then you need to extract a bunch of features, as #hmason says. And then you need an appropriate machine learning algorithm; you either need a function approximator (where you try to use your features to predict a value between, say, 0 and 1, where 1 is "best tweet ever" and 0 is "omg who cares") or a classifier (where you use your features to try to predict whether it's a "good" or "bad" tweet).
If you go for the latter--which makes user-training easy, since they just have to score tweets with "like" (to mix social network metaphors)--then you typically do best with support vector machines, for which there exists a fairly comprehensive Java library.
In the former case, there are a variety of techniques that might be worth trying; if you decide to use the LIBSVM library, they have variants for regression (i.e. parameter estimation) as well.