I recently saw a demo of a .NET 3.5 product, which has a "Universal Search" widget ... ie, it allowed you to search their entire product, for either your own strings, or their strings, and the results were context-sensitive links to different parts of the application.
For example, let's say this was a Point of Sale system, you could search it for "Burger" and find:
Employee "John Burgers"
Menu Items "CheeseBurger", and "Burger"
Report "Burger Sales"
Etc...
It was a pretty neat "one search box to rule them all" type control.
We'd love to throw something similar in our product, which is a Java web-app ... just not even sure where to start. Any ideas?
Hmm, one idea comes to mind. For various categories in your product have a "search agent". e.g. Lets say you have following categories (or modules) in your app:
Preferences
People (both people and company)
Reporting
Tags (This is a good option to for searching. Tag most entities in your app)
Each of these will have a search agent. These search agents will register with the universal search widget's backend (You can have an options on where to search. This panel will show when the user clicks on advanced search. By default the search will be "entire app")
Upon the search the widget will ask each of these to search in their own category and then collate the results.
Of course there could be other ideas. This is just one of them. You will have to think about its pros and cons, like how will it impact your DB, etc.
Related
I'm not sure on how to approach the following problem, and I'm looking for some guidance:
I have a File which contains an random Ad Title on each line. What I need to do is to classify each title in smartphone or not-smartphone, depending if the Ad is selling a mobile phone or not.
I'm sorry the file isn't in english, but here is a screenshot showing a little bit of it:
complete file here
Problems I've encountered:
Some Ad titles are related to smartphones, but they aren't actually selling phones, but something related to it (acessory). Example: an ad selling phone cases for Iphone X
Some ad titles doesn't even have the phone brand, but only the model. Example: "White Xiaomi Mi Mix 2s Global 64GB" or "J7 Pro 64gb 4g J730".
It would be perfect if there was a way to extract the exact phone model from the title, but since each ad title is formatted differently, I couldn't find a way to do this.
Usually brands produce a variety of products, and smartphones are just one type of product. So when I filter by the brand name, it often returns me ads which aren't related to smartphones at all (tablets, TVs, chargers, etc). More filtering would be needed
Even though I am allowed to use it, I couldn't find some DB with a list of all smartphone models, or I don't know how to retrieve information from them.
What I've thought so far:
if I had access to a DataBase with a big quantity of smartphone models, I could directly search the file for each model name (example; "Iphone 5s" or "Moto G6").
I tried using FonoAPI https://fonoapi.freshpixl.com (which is a smartphone database for consulting data about the phones using java,php, etc) to search for smartphone models from a specific brand, but the api will only return a max of 100 results per time. So in order to use it, I would need to extract the product model name from the title so I can check if it is listed in the FonoAPI DB
So since each ad title in the file is formatted differently, I'm looking for some ideas on how to do this, because I couldn't find a way to extract the product model from the title to compare with FonoAPI database, neither get access to some big DB containing a vast quantity of models to directly look for them in the file.
My answer is not very precise and more like ideas i wanted to propose (because i like this problem and would be happy to get the file, seems like it is impossible to get it from your link).
First, as all NLP problems you need to ensure all the text is formatted the same way.
To get a phone model database. I would try to get a database with phone brands. Then go on a selling website to do web scraping. This way you would get a lot of phone model.
I would try to use some NLP model like LDA, but with another formatted way (like getting away the words beyond a limit after gb and the phone brands. We can hope all the phone are close to those words).
This can be stupid ideas, but i wanted to share (and i can not comment :D).
I have read over the chapter "Learning from clicks" in the book Programming Collective Intelligence and liked the idea: The search engine there learns on which results the user clicked and use this information to improve the ranking of results.
I think it would improve the quality of the search ranking a lot in my Java/Elasticsearch application if I could learn from the user clicks.
In the book, they build a multiplayer perceptron (MLP) network to use the learned information even for new search phrases. They use Python with a SQL database to calculate the search ranking.
Has anybody implemented something like this already with Elasticsearch or knows an example project?
It would be great, if I could manage the clicking information directly in Elasticsearch without needing an extra SQL database.
In the field of Information Retrieval (the general academic field of search and recommendations) this is more generally known as Learning to Rank. Whether its clicks, conversions, or other forms of sussing out what's a "good" or "bad" result for a keyword search, learning to rank uses either a classifier or regression process to learn what features of the query and document correlate with relevance.
Clicks?
For clicks specifically, there's reasons to be skeptical that optimizing clicks is ideal. There's a paper from Microsoft Research I'm trying to dig up that claims that in their case, clicks are only 45% correlated with relevance. Click+dwell is often a more useful general-purpose indicator of relevance.
There's also the risk of self-reinforcing bias in search, as I talk about in this blog article. There's a chance that if you're already showing a user mediocre results, and they keep clicking on those mediocre results, you'll end up reinforcing search to keep showing users mediocre results.
Beyond clicks, there's often domain-specific considerations for what you should measure. For example, clasically in e-commerce, conversions matter. Perhaps a search result click that led to such a purchase should count more. Netflix famously tries to suss out what it means when you watch a movie for 5 minutes and go back to the menu vs 30 minutes and exit. Some search use cases are informational: clicking may mean something different when you're researching and clicking many search results vs when you're shopping for a single item.
So sorry to say it's not a silver bullet. I've heard of many successful and unsuccessful attempts at doing Learning to Rank and it mostly boils down to how successful you are at measuring what your users consider relevant. The difficulty of this problem surprises a lot of peop.le
For Elasticsearch...
For Elasticsearch specifically, there's this plugin (disclaimer I'm the author). Which is documented here. Once you've figured out how to "grade" a document for a specific query (whether its clicks or something more) you can train a model that can be then fed into Elasticsearch via this plugin for your ranking.
What you would need to do is store information about the clicks in a field inside the Elasticsearch index. Every click would result in an update of a document. Since an update action is actually a delete and insert Update API, you need to make sure your document text is stored, not only indexed. You can then use a Function Score Query to build a score function reflecting the value stored in the index.
Alternatively, you could store the information in a separate database and use a script function inside the score function to access the database. I wouldn't suggest this solution due to performance issues.
I get the point of your question. You want to build learning to rank model within Elasticsearch framework. The relevance of each doc to the query is computed online. You want to combine query and doc to compute the score, so a custom function to compute _score is needed. I am new in elasticsearch, and I'm finding a way to solve the problem.
Lucene is a more general search engine which is open to define your own scorer to compute the relevance, and I have developed several applications on it before.
This article describes the belief understanding of customizing scorer. However, on elasticsearch, I haven't found related articles. Welcome to discuss with me about your progress on elasticsearch.
I am planning to create an affiliate site (Price Comparison site).
As you all know that DATA (products and their Info.) from different sites(Ecomm sites) plays a vital role in these type of price comparison sites.
I have already wrote scripts to scrap data for products from the sites of my interest and its working as expected.
In more detail, I am scrapping following common parameters and storing them in my DB.
1)product Title , 2) Product Description , 3) Price , 4) Pay modes etc.
[FYI: I used JSOUP APIs to scrap data]
PROBLEM STARTS HERE:
I want to group products [same product] from different sources which I
scrapped from these sites.
To illustrate my questing:
Say XYZ is product sold on 5 different sites with some changes in Its PRODUCT TITLE.
I scrapped data from these 5 sites saved it to my DB now how should I effectively group these products to single group. so that I can show 5 different sources on single page of my site.
I do not have any clue that how should I proceed in it.
[String comparison is first thought that comes to my mind but do not think that i'll work in long run.]
Any suggestions / Recommendation are welcomed and appreciated.
I you require any further information please do not hesitate to add comments.
-JS
At initial phase you can use solr for getting best score while comparison between product title or moreover its descriptions.
More in depth if we think about user side, why a product is consider as common product. these are the features which makes product common. like brand, color , material blah blah....
Make a dictionary of feature set for different catalog which should be same while declaring any product as common product.
it may be possible then for a same feature set we have many products to identify, in this case u can take help from solr for scoring...
Moreover You can check google image search api which at the end help to get image similarity scoring. this will be helpful in finding of common products for fashion catalogues
Hope it will help...
I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?
If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.
Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)
A market we are building allows people to list their stuff to sell but in batches / bags / boxes. We are looking to build in a recommendation engine for this, but most of the articles out there seem better suited for markets that "sell" large quantities of many products - ie amazon, netflix etc. Because every listing is somewhat unique, what is the best approach for a recommendation engine? Any relevant articles out there?
We know items people have bought in the past.
We know the size or age appropriateness they are looking for.
The listed bundles have categories, brands, sizes/ages, colors and free form text.
Any ideas to help us get started? Any particular language you think would be best if our data is stored in MySQL?
There are several things you can filter with a recommendation engine. You can filter on what a particular user has bought before (in your case, which features have been present in the products they have bought). You can also filter on social groupings--users like them, or on product groupings--other products like the ones you have sold before. I'd recommend that you first cluster the products, and then map the individual or groups to the features in that cluster of products. So, you'll end up with a recommendation engine that says: people who bought items with this feature also bought items with these features. Then, you can create an engine for known users: you tend to buy products with these features, here are some more items like those. Finally, you can build an engine for groups: people like this tend to buy products with these features.
With several models in hand, your system can turn to the appropriate one, depending on what they know at the moment: known user, known user group, or just known browsing history.
Since you are recommending batches of more unique products, you'll want to add an additional model after you get your recommendations that will filter out inappropriate recommendations. This model will represent compatibility. A new game using the same console that the user used before is more compatible than another console. If they bought a new car last month, you wouldn't recommend a new car, but maybe a package of ten car washes.
You could use several different concepts for this last model. If you are going to add explicit knowledge to your model that's in people's heads, you may want to build a belief network that filters out inappropriate recommendations. If you are going to use collective intelligence, you could use simple regression, a support vector machine, or an artificial neural network. I would go with the easiest to implement filter and not worry about choosing the first model you build. You'll probably build a handful of models before you settle on one giving you good results with appropriate effort.
Your filtering model will go through a test phase where you make a recommendation, filter it for appropriateness, then filter it again with some sort of human intervention--a set of "answers" you want your filter to learn, or just a human being double-checking results. Then you'll retrain your filter with the updated results, resample and test again.
As far as the recommendation engine goes, you can do SVD with the GNU scientific library (bindings available for about any platform). You could also choose the Mahout recommendation engine (part of the Hadoop world) if you are going to be using big data. For the filter, you may want to look at apophenia, libsvm, or FANN.
You could also choose to work in an analytics framework for a while until you feel like you've got a handle on things. Some to choose from are Weka, R, Octave, Matlab, Maple, and Mathematica. I think I've listed those in terms of price first, then ease of use.
As far as resources, there are a few good introductory books: Collective Intelligence, Mahout (MEAP from Manning), Data Mining (all about Weka), and Modeling with Data (apophenia reference).
My last thought is that however sophisticated you do or don't get with your recommendation engine, most of the value is in the user experience. One of the people from Amazon wrote that their recommendation engines worked best when they told the user why they were making a recommendation. That helps the user quickly adopt your reasoning (an emotive response to their old and good purchase), or reject it and keep going (they already have something like that, they don't need another one).
Personally I prefer Ruby, but Ruby, Python and Perl can easily connect to MySQL.
One of the reasons I like Ruby is its Sequel gem which is a very powerful ORM, making database access very easy to manage. If you go with a MVC, Ruby has Rails which favors ActiveRecord as its ORM, which also makes it easy to talk to MySQL. There's also Sinatra and Padrino, which are a bit lighter weight ORMs, but very capable too. They're more DB agnostic out of the box and nicely integrate with Sequel.