I came across this python library https://pypi.org/project/weighted-levenshtein/ which allows to specify different costs/weights for different operations(insertion, substitution, deletion and transposition) which is very helpful in detecting and correcting keystroke errors.
I have been searching through lucene library FuzzySearch which uses Damerau-Levenstein distance to check if something like this is supported to specify different costs/weights for different operations but not able to find any.
Please let me know if there exists a way to specify our custom costs/weights within Lucene Fuzzy-Search.
Thanks in advance!
To accomplish this you would have to extend and/or edit lucene code. To support fuzzy matching, lucene compiles an Automaton using the LevenshteinAutomata class, which implements this algorithm, and not only doesn't support edit weights, but only supports matching for up to 0 to 2 edits.
How one might edit this algorithm to produce an automaton that supports weighted edits is beyond my knowledge, but could be worth a try as it would make your customization simple (would only have to override the getAutomaton method) and would (theoretically) keep performance consistent.
The alternative would be to forgo the idea of an automaton to support fuzzy matching and simply implement a weighted levenshtein algorithm, like the one you have linked to, directly in the actual fuzzy match check. By doing this, however, you could pay a rather high performance cost depending on the nature of the fuzzy queries you handle and the content of your index.
Related
Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.
I've read a lot of articles/blogs/papers, but still don't know where to start.
TD-IDF+cosine similarity seems not good enough? Levenshtein distance?
What about semantic similarity? But how?
I read about this
In the comments #mbatchkarov seems provide a correct direction.
Does anyone has similar experience on the subject? Any suggestions are welcome.
A lot of the time the accepted answer to the question you linked to can get you quite far. You can compare the word (lemma) overlap between a query and all queries in the cache. To improve performance, you can incorporate word similarity to help you link semantically similar words. The thesaurus-building software I linked to in my is BSD-licensed, so you are free to use it as you see fit. If you need any help using it, the developers (disclaimer: I am a part of the team) will be happy to help out. In fact, I've got a few pre-built thesauri lying around. These should probably be a part of the software, but they are too large to upload to github.
Whichever approach you go for, be aware that there will be many cases where this does not work well. This is because the approaches discussed in that question are about semantic similarity, and your application may require semantic equivalence. For example, "I like big ginger cats" and "We like big ginger cats" or "We like small ginger cats" are very similar in meaning, but it would be wrong to use the translation of one as a translation of the other.
I have a requirement to associate math terms that come under a common topic. For e.g. angles, cos, tan, etc., should relate to trigonometry. So when a user searches for angles, triangles, etc. the search should present results related to trigonometry as well. Can anyone provide leads on how to do this in Apache Lucene?
There is a classification api which includes K-nearest neighbors and naive Bayes models.
You would first use the train() method with your training set. Once the classifier is trained use the assignClass() method to classify a given string.
For a training set you could use Wikipedia pages for your given classes.
After you give those two a try you could make use of the Classifier interface to build a competing model.
If you already know the associations, you can just add them to the index for the specific terms -- i.e. indexing 'cos' as 'cos', 'trigonometry'.
Also if you know the associations, you could index the parent term and all of the sibling terms -- i.e. indexing 'cos' as 'trigonometry', 'cos', 'sin', etc. This sounds more like what you want.
In addition to #Josh S.'s good answer, you can also take a more direct approach, of generating your own synonyms dictionary, e.g. see Match a word with similar words using Solr?
I would like to use Lucene to write my own search engine. Because I use spatial information, I would like to try some index structures which are more suitable for spatial data. As far as I know there is no alternative structure available in Lucene itself, also LGTE (Lucene extension for geo-temporal date) seems not to let you access other structures.
Did I just not see other structures or do I have to implement them?
The direct and simple answer to the title of your question, "can you use another index structure" is that you can't -- at least you can't if it would have a different API than Lucene's. In a nutshell, it is fundamentally a sorted mapping of bytes to DocIds + optionally postings (position offsets for a document plus optionally "payloads" (arbitrary bytes for a posting).
That said, I suppose you could implement a so-called Lucene Codec (new in Lucene 4.x) that has its own extended API and search against a field that assumes your specific Codec. Codecs are envisioned to have different implementations of Lucene's APIs (e.g. balancing what's in memory vs on-disk, when to cleverly compress/encode vs directly represent data) but not to introduce different API as well. But I suppose you could.
The context beyond the title of your question is that you want to do this for spatial/temporal because, it appears to me you don't believe that Lucene's index is sufficiently suitable. I strongly disagree. There have been great strides in Lucene spatial recently and it has more to go still. For example many non-relational databases will only let you index point data but Lucene spatial can handle any Spatial4j shape (a dependency) and with JTS (another dependency), that's basically all the typical shapes most people want. And there are scalable recursive algorithms for matching indexed shapes via Intersects, Within, Contains, and Disjoint predicates. I expect some big performance enhancements by end of summer or end of year at the latest. You may find this very recent post I responded to on the Solr-user list interesting:
http://lucene.472066.n3.nabble.com/Multi-dimensional-spatial-search-tt4062515.html#a4062646
So instead I propose that you help me improve the parts of Lucene spatial that need it for whatever system you are building. Perhaps it already fits the bill.
I looking for a program or library in Java capable of finding non-random properties of a byte sequence. Something when given a huge file, runs some statistical tests and reports if the data show any regularities.
I know three such programs, but not in Java. I tried all of them, but they don't really seem to work for me (which is quite surprising as one of them is by NIST). The oldest of them, diehard, works fine, but it's a bit hard to use.
As some of the commenters have stated, this is really an expert mathematics problem. The simplest explanation I could find for you is:
Run Tests for Non-randomness
Autocorrelation
It's interesting, but as it uses 'heads or tails' to simplify its example, you'll find you need to go much deeper to apply the same theory to encryption / cryptography etc - but it's a good start.
Another approach would be using Fuzzy logic. You can extract fuzzy associative rules from sets of data. Those rules are basically implications in the form:
if A then B, interpreted for example "if 01101 (is present) then 1111 (will follow)"
Googling "fuzzy data mining"/"extracting fuzzy associative rules" should yield you more than enough results.
Your problem domain is quite huge, actually, since this is what data/text mining is all about. That, and statistical & combinatorial analysis, just to name a few.
About a program that does that - take a look at this.
Not so much an answer to your question but to your comment that "any observable pattern is bad". Which got me thinking that randomness wasn't the problem but rather observable patterns, and to tackle this problem surely you need observers. So, in short, just set up a website and crowdsource it.
Some examples of this technique applied to colour naming: http://blog.xkcd.com/2010/05/03/color-survey-results/ and http://www.hpl.hp.com/personal/Nathan_Moroney/color-name-hpl.html
I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.