I am planning to create an affiliate site (Price Comparison site).
As you all know that DATA (products and their Info.) from different sites(Ecomm sites) plays a vital role in these type of price comparison sites.
I have already wrote scripts to scrap data for products from the sites of my interest and its working as expected.
In more detail, I am scrapping following common parameters and storing them in my DB.
1)product Title , 2) Product Description , 3) Price , 4) Pay modes etc.
[FYI: I used JSOUP APIs to scrap data]
PROBLEM STARTS HERE:
I want to group products [same product] from different sources which I
scrapped from these sites.
To illustrate my questing:
Say XYZ is product sold on 5 different sites with some changes in Its PRODUCT TITLE.
I scrapped data from these 5 sites saved it to my DB now how should I effectively group these products to single group. so that I can show 5 different sources on single page of my site.
I do not have any clue that how should I proceed in it.
[String comparison is first thought that comes to my mind but do not think that i'll work in long run.]
Any suggestions / Recommendation are welcomed and appreciated.
I you require any further information please do not hesitate to add comments.
-JS
At initial phase you can use solr for getting best score while comparison between product title or moreover its descriptions.
More in depth if we think about user side, why a product is consider as common product. these are the features which makes product common. like brand, color , material blah blah....
Make a dictionary of feature set for different catalog which should be same while declaring any product as common product.
it may be possible then for a same feature set we have many products to identify, in this case u can take help from solr for scoring...
Moreover You can check google image search api which at the end help to get image similarity scoring. this will be helpful in finding of common products for fashion catalogues
Hope it will help...
Related
I'm not sure on how to approach the following problem, and I'm looking for some guidance:
I have a File which contains an random Ad Title on each line. What I need to do is to classify each title in smartphone or not-smartphone, depending if the Ad is selling a mobile phone or not.
I'm sorry the file isn't in english, but here is a screenshot showing a little bit of it:
complete file here
Problems I've encountered:
Some Ad titles are related to smartphones, but they aren't actually selling phones, but something related to it (acessory). Example: an ad selling phone cases for Iphone X
Some ad titles doesn't even have the phone brand, but only the model. Example: "White Xiaomi Mi Mix 2s Global 64GB" or "J7 Pro 64gb 4g J730".
It would be perfect if there was a way to extract the exact phone model from the title, but since each ad title is formatted differently, I couldn't find a way to do this.
Usually brands produce a variety of products, and smartphones are just one type of product. So when I filter by the brand name, it often returns me ads which aren't related to smartphones at all (tablets, TVs, chargers, etc). More filtering would be needed
Even though I am allowed to use it, I couldn't find some DB with a list of all smartphone models, or I don't know how to retrieve information from them.
What I've thought so far:
if I had access to a DataBase with a big quantity of smartphone models, I could directly search the file for each model name (example; "Iphone 5s" or "Moto G6").
I tried using FonoAPI https://fonoapi.freshpixl.com (which is a smartphone database for consulting data about the phones using java,php, etc) to search for smartphone models from a specific brand, but the api will only return a max of 100 results per time. So in order to use it, I would need to extract the product model name from the title so I can check if it is listed in the FonoAPI DB
So since each ad title in the file is formatted differently, I'm looking for some ideas on how to do this, because I couldn't find a way to extract the product model from the title to compare with FonoAPI database, neither get access to some big DB containing a vast quantity of models to directly look for them in the file.
My answer is not very precise and more like ideas i wanted to propose (because i like this problem and would be happy to get the file, seems like it is impossible to get it from your link).
First, as all NLP problems you need to ensure all the text is formatted the same way.
To get a phone model database. I would try to get a database with phone brands. Then go on a selling website to do web scraping. This way you would get a lot of phone model.
I would try to use some NLP model like LDA, but with another formatted way (like getting away the words beyond a limit after gb and the phone brands. We can hope all the phone are close to those words).
This can be stupid ideas, but i wanted to share (and i can not comment :D).
I've been using Yahoo! Finance API to fetch stock prices. The XML in the links 1 and 2 have a list of most ticker symbols I would need, but they are jumbled and not sorted based on the stock markets they are associated with (for e.g., I would like to associate Google's ticker symbol 'GOOG' against NASDAQ). Given the total number of companies listed in the XML files, it would be extremely time consuming to try and manually associate them against their stock market.
Is there anyway we can achieve this using code (preferably Java) or is there an available site which does that for us? I have been Googling this for the past couple of days but haven't been able to find anything helpful. Please let me know if anymore details are required.
I have read over the chapter "Learning from clicks" in the book Programming Collective Intelligence and liked the idea: The search engine there learns on which results the user clicked and use this information to improve the ranking of results.
I think it would improve the quality of the search ranking a lot in my Java/Elasticsearch application if I could learn from the user clicks.
In the book, they build a multiplayer perceptron (MLP) network to use the learned information even for new search phrases. They use Python with a SQL database to calculate the search ranking.
Has anybody implemented something like this already with Elasticsearch or knows an example project?
It would be great, if I could manage the clicking information directly in Elasticsearch without needing an extra SQL database.
In the field of Information Retrieval (the general academic field of search and recommendations) this is more generally known as Learning to Rank. Whether its clicks, conversions, or other forms of sussing out what's a "good" or "bad" result for a keyword search, learning to rank uses either a classifier or regression process to learn what features of the query and document correlate with relevance.
Clicks?
For clicks specifically, there's reasons to be skeptical that optimizing clicks is ideal. There's a paper from Microsoft Research I'm trying to dig up that claims that in their case, clicks are only 45% correlated with relevance. Click+dwell is often a more useful general-purpose indicator of relevance.
There's also the risk of self-reinforcing bias in search, as I talk about in this blog article. There's a chance that if you're already showing a user mediocre results, and they keep clicking on those mediocre results, you'll end up reinforcing search to keep showing users mediocre results.
Beyond clicks, there's often domain-specific considerations for what you should measure. For example, clasically in e-commerce, conversions matter. Perhaps a search result click that led to such a purchase should count more. Netflix famously tries to suss out what it means when you watch a movie for 5 minutes and go back to the menu vs 30 minutes and exit. Some search use cases are informational: clicking may mean something different when you're researching and clicking many search results vs when you're shopping for a single item.
So sorry to say it's not a silver bullet. I've heard of many successful and unsuccessful attempts at doing Learning to Rank and it mostly boils down to how successful you are at measuring what your users consider relevant. The difficulty of this problem surprises a lot of peop.le
For Elasticsearch...
For Elasticsearch specifically, there's this plugin (disclaimer I'm the author). Which is documented here. Once you've figured out how to "grade" a document for a specific query (whether its clicks or something more) you can train a model that can be then fed into Elasticsearch via this plugin for your ranking.
What you would need to do is store information about the clicks in a field inside the Elasticsearch index. Every click would result in an update of a document. Since an update action is actually a delete and insert Update API, you need to make sure your document text is stored, not only indexed. You can then use a Function Score Query to build a score function reflecting the value stored in the index.
Alternatively, you could store the information in a separate database and use a script function inside the score function to access the database. I wouldn't suggest this solution due to performance issues.
I get the point of your question. You want to build learning to rank model within Elasticsearch framework. The relevance of each doc to the query is computed online. You want to combine query and doc to compute the score, so a custom function to compute _score is needed. I am new in elasticsearch, and I'm finding a way to solve the problem.
Lucene is a more general search engine which is open to define your own scorer to compute the relevance, and I have developed several applications on it before.
This article describes the belief understanding of customizing scorer. However, on elasticsearch, I haven't found related articles. Welcome to discuss with me about your progress on elasticsearch.
I was looking a price comparison site like this. So the question is how it knows two products from two different sites to be of same product and clubs the two to same bucket to show the price comparison.
If it is only books than i can understand that all books have unique ISBN number so just write some website specific code which will fetch data from the websites and compare.
e.g. you have two websites:
www.xyz.com
www.pqr.com
Now these two websites list their books differently i.e. the html will be different, so parse the HTML and fetch ISBN, price from it. Than for corresponding ISBN we can put the two website's price. It is simple, but how you will parse for products which does not have an id which is unique and uniform (like presser cooker, watch etc…) across websites like ISBN.
Thanks.
Other products also have identification numbers, in Europe it is the EAN which is currently turned into a global number called GTIN. In ecommerce usually Amazon IDs (ASIN, of which ISBN is a subset) are often used.
If you don't have these numbers available, which is usually the case, you will need a strategy called Record Linkage or Data Matching.
TL;DR It usually uses a string matching algorithm to find similar "worded" products (using an inverted index on n-grams for example). In the end you can use machine-learning to remove the wrong matches (false-positives). This requires a lot of training data (there are no or too small public datasets available) and thus most of the time a human will check those matches.
For a more detailed analysis of the problem I can only recommend reading the book Data Matching by Peter Christen. It goes deep into information retrieval (how to find similar products) and then how to sort out wrong or right matches using machine-learning (e.g. via structural analysis).
There are also plenty of papers by him available on the net, so checkout his scholar profile.
I recently saw a demo of a .NET 3.5 product, which has a "Universal Search" widget ... ie, it allowed you to search their entire product, for either your own strings, or their strings, and the results were context-sensitive links to different parts of the application.
For example, let's say this was a Point of Sale system, you could search it for "Burger" and find:
Employee "John Burgers"
Menu Items "CheeseBurger", and "Burger"
Report "Burger Sales"
Etc...
It was a pretty neat "one search box to rule them all" type control.
We'd love to throw something similar in our product, which is a Java web-app ... just not even sure where to start. Any ideas?
Hmm, one idea comes to mind. For various categories in your product have a "search agent". e.g. Lets say you have following categories (or modules) in your app:
Preferences
People (both people and company)
Reporting
Tags (This is a good option to for searching. Tag most entities in your app)
Each of these will have a search agent. These search agents will register with the universal search widget's backend (You can have an options on where to search. This panel will show when the user clicks on advanced search. By default the search will be "entire app")
Upon the search the widget will ask each of these to search in their own category and then collate the results.
Of course there could be other ideas. This is just one of them. You will have to think about its pros and cons, like how will it impact your DB, etc.