Text similarity function for strict document similarity

Text similarity function for strict document similarity - java

I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.
The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they really are.
I expect the text of the two documents to be either very similar or not at all, so I can be rather strict about the threshold to set for similarity. For example I could say that the two documents are similar only if they have 90% of their words in common, but I would like to have something more robust, which would work for texts short and long alike.
To sum it up I have:
two documents, either very similar or not similar at all, but:
it is more likely for the two documents to be similar than not
documents can be both long (some paragraphs) and short (a few sentences)
I've experimented with simmetrics, which has a large array of string matching function, but I'm most interested in suggestion about possible algorithms to use.
Possible candidates I have are:
Levenshtein: its output is more significant for short texts
overlapping coefficient: maybe, but will it discriminate well for documents of different lenght?
Also considering two texts similar only when they are exactly the same would not work well, because I'd like for documents that differ only for a few words to pass the similarity test.

Levenshtein is appropriate for the edit distance between two words; if you are comparing documents, something like diff will probably be more along the lines of what you need.
I would start here: http://c2.com/cgi/wiki?DiffAlgorithm. They provide links to a number of diff-style algorithms you can look into.

Levenshtein distance is used to compare two words. When it's documents, popular ways are cosine similarity or Latent Semantic Analysis.

Levenshtein distance is the standard measure for a reason: it's easy to compute and easy to grasp the meaning of. If you are wary of the number of characters in a long document, you can just compute it on words or sentences or even paragraphs instead of characters. Since you expect the similar pairs to be very similar, that should still work well.

Levenshtein seems to be the best solution here. If you are trying to get a weighted similiarity ranking - which I guess is the case because you mentioned that the output of Levenshten is more significant for shorter texts - then just weight the result of the levenshtein algorithm by dividing by the number of characters in the document.

Related

Algorithm to remove words in corpus with small occurrence

I have a large (+/- 300,000 lines) dataset of text fragments that contain some noisy elements. With noisy I mean words of slang, type errors, etc… I wish to filter out these noisy elements to have a more clean dataset.
I read some papers that propose to filter these out by keeping track of the occurrence of each word. By setting a treshold (eg. less than 20) we can assume these words are noise and thus can safely be removed from the corpus.
Maybe there are some libraries or algorithms that do this in a fast and efficient way. Ofcourse I tried it myself first but this is EXTREMELY slow!
So to summarize, I am looking for an algorithm that can filter out words, in a fast and efficient way, that occur less than a particular treshold. Maybe I add a small example:
This is just an example of whaat I wish to acccomplish.
The words 'whaat' and 'acccomplish' are misspelled and thus likely to occur less often (If we assume to live in a perfect world and typos are rare …). I wish to end up with
This is just an example of I wish to.
Thanks!
PS: If possible, I'd like to have an algorithm in Java (or pseudo-code so I can write it myself)

I think you are over complicating it with the approach suggested in comments.
You can do it with 2 passes on the data:
Build a histogram: Map<String,Integer> that counts number of occurances
For each word, print it to the new 'clean' file if and only if map.get(word) > THRESHOLD
As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD where MEAN is the average number of words, and STD is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.

Using parallel algorithms when reading documents [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Improving performance of preprocessing large set of documents
Hi,
I have a document set contain about 100 documents. I have to preprocess each of these documents and compare these documents with each other. If I do it in sequential manner it will consume huge amount of time. So I want to know some parellel algorithms that can be used and how can i implement those using Java.
Ragards,
nuwan

There is a lot of literature about detecting document similarity. You need to do a literature search and/or a web search for software / algorithms / techniques that matches your requirements.
Simply replacing a brute-force sequential pair-wise comparison with a brute-force parallel pair-wise comparison is not the answer. That approach only gives you an O(P) speedup (at best), where you have to deal with O(N^2 * S^2) where N is the number of documents and S is the average document size.
For a start, the classic way of finding similarities between two large text files involves breaking each file into lines, calculating hashes of each the respective file's lines, sorting the hashes and comparing them. This process is O(SlogS) ...

If you have documents d1, d2, d3, d4 - if you compared each document with all other documents, then it would be O(N^2). However, I assume that comparing d1 to d2 is the same as comparing d2 to d1, so you can optimize there. So basically, you only need to compare d1-d2, d1-d3, d1-d4, d2-d3, d2-d4, d3-d4, which is O((N-1)!).
Perhaps start by building a map of all comparisons that need to be done. Then, split that map into X equal size collections, where X is the number of processes you want to run. Finally, spin off that many threads (or farm the work out to that many servers), and let them run, then merge the results back together.
If you need to preprocess each document individually (so the comparisons really don't matter at that point), then just break the problem up into as many processes as you want, and distribute that work across the processes. Without really know what kind of preprocessing and comparison and document types you're dealing with, I can't really get into much more specifics than that.

I'm assuming your looking for similarities between documents rather than identical documents - if that were the case you could generate a checksum for each document in parallel and then comparing then would be relatively easy.
For similarities you could use a fingerprinting approach. I have a friend how uses this for looking for text reuse in a large corpus of documents. You can calculate the fingerprints for each document in parallel and then load the fingerprints to do the match in memory and parallel.
Winnowing: Local Algorithms for Document Fingerprinting

BLEU score implementation for sentence similarity detection

I need to calculate BLEU score for identifying whether two sentences are similar or not.I have read some articles which are mostly about BLEU score for Measuring Machine translation accuracy.But i'm in need of a BLEU score to find out similarity between sentences in a same language[English].(i.e)(Both the sentences are in English).Thanks in anticipation.

For sentence level comparisons, use smoothed BLEU
The standard BLEU score used for machine translation evaluation (BLEU:4) is only really meaningful at the corpus level, since any sentence that does not have at least one 4-gram match will be given a score of 0.
This happens because, at its core, BLEU is really just the geometric mean of n-gram precisions that is scaled by a brevity penalty to prevent very short sentences with some matching material from being given inappropriately high scores. Since the geometric mean is calculated by multiplying together all the terms to be included in the mean, having a zero for any of the n-gram counts results in the entire score being zero.
If you want to apply BLEU to individual sentences, you're better off using smoothed BLEU (Lin and Och 2004 - see sec. 4), whereby you add 1 to each of the n-gram counts before you calculate the n-gram precisions. This will prevent any of the n-gram precisions from being zero, and thus will result in non-zero values even when there are not any 4-gram matches.
Java Implementation
You'll find a Java implementation of both BLEU and smooth BLEU in the Stanford machine translation package Phrasal.
Alternatives
As Andreas already mentioned, you might want to use an alternative scoring metric such as Levenstein's string edit distance. However, one problem with using the traditional Levenstein string edit distance to compare sentences is that it isn't explicitly aware of word boundaries.
Other alternatives include:
Word Error Rate - This is essentially the Levenstein distance applied to a sequence of words rather than a sequence of characters. It's widely used for scoring speech recognition systems.
Translation Edit Rate (TER) - This is similar to word error rate, but it allows for an additional swap edit operation for adjacent words and phrases. This metric has become popular within the machine translation community since it correlates better with human judgments than other sentence similarity measures such as BLEU. The most recent variant of this metric, known as Translation Edit Rate Plus (TERp), allows for matching of synonyms using WordNet as well as paraphrases of multiword sequences ("died" ~= "kicked the bucket").
METEOR - This metric first calculates an alignment that allows for arbitrary reordering of the words in the two sentences being compared. If there are multiple possible ways to align the sentences, METEOR selects the one that minimizes crisscrossing alignment edges. Like TERp, METEOR allows for matching of WordNet synonyms and paraphrases of multiword sequences. After alignment, the metric computes the similarity between the two sentences using the number of matching words to calculate a F-α score, a balanced measure of precision and recall, which is then scaled by a penalty for the amount of word order scrambling present in the alignment.

Here you go: http://code.google.com/p/lingutil/

Well, if you just want to calculate the BLEU score, it's straightforward. Treat one sentence as the reference translation and the other as the candidate translation.

Maybe the (Levenstein) edit distance is also an option, or the Hamming distance. Either way, the BLEU score is also appropriate for the job; it measures the similarity of one sentence against a reference, so that only makes sense when they're in the same language like with your problem.

You can use Moses multi-bleu script, where you can also use multiple references: https://github.com/moses-smt/mosesdecoder/blob/RELEASE-2.1.1/scripts/generic/multi-bleu.perl

You are not encouraged to implement the BLEU yourself, and the SACREBLEU is a standard implementation.
from datasets import load_metric
metric = load_metric("sacrebleu")

Percentage Similarity Analysis (Java)

I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically";
String b = "Web Crawler computer program browses the World Wide Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like
(7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.

As Konrad pointed out, your question depends heavily on what you mean by "similar".
In general, I would say the following guidelines should be of use:
normalize the input by reducing a word to it's base form and lowercase it
use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences
You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”
A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.
Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.
In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.

I second what Konrad Rudolf has already said.
Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.
Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well
Sentence splitting. Figure out your units of comparison.
stop-word removal: take out a, an, the, of, etc.
bag of words percentage: what percentage of the overall words match, independent of ordering
(much more aggressive) you could try synonym expansion, which counts synonyms as matched words.

The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).
The human-similarity can never be easily calculated: for instance these three words
cellphone car message
mobile automobile post
The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.

One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library

The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming

Way to store a large dictionary with low memory footprint + fast lookups (on Android)

I'm developing an android word game app that needs a large (~250,000 word dictionary) available. I need:
reasonably fast look ups e.g. constant time preferable, need to do maybe 200 lookups a second on occasion to solve a word puzzle and maybe 20 lookups within 0.2 second more often to check words the user just spelled.
EDIT: Lookups are typically asking "Is in the dictionary?". I'd like to support up to two wildcards in the word as well, but this is easy enough by just generating all possible letters the wildcards could have been and checking the generated words (i.e. 26 * 26 lookups for a word with two wildcards).
as it's a mobile app, using as little memory as possible and requiring only a small initial download for the dictionary data is top priority.
My first naive attempts used Java's HashMap class, which caused an out of memory exception. I've looked into using the SQL lite databases available on android, but this seems like overkill.
What's a good way to do what I need?

You can achieve your goals with more lowly approaches also... if it's a word game then I suspect you are handling 27 letters alphabet. So suppose an alphabet of not more than 32 letters, i.e. 5 bits per letter. You can cram then 12 letters (12 x 5 = 60 bits) into a single Java long by using 5 bits/letter trivial encoding.
This means that actually if you don't have longer words than 12 letters / word you can just represent your dictionary as a set of Java longs. If you have 250,000 words a trivial presentation of this set as a single, sorted array of longs should take 250,000 words x 8 bytes / word = 2,000,000 ~ 2MB memory. Lookup is then by binary search, which should be very fast given the small size of the data set (less than 20 comparisons as 2^20 takes you to above one million).
IF you have longer words than 12 letters, then I would store the >12 letters words in another array where 1 word would be represented by 2 concatenated Java longs in an obvious manner.
NOTE: the reason why this works and is likely more space-efficient than a trie and at least very simple to implement is that the dictionary is constant... search trees are good if you need to modify the data set, but if the data set is constant, you can often run a way with simple binary search.

I am assuming that you want to check if given word belongs to dictionary.
Have a look at bloom filter.
The bloom filter can do "does X belong to predefined set" type of queries with very small storage requirements. If the answer to query is yes, it has small (and adjustable) probability to be wrong, if the answer to query is no, then the answer guaranteed to be correct.
According the Wikipedia article you could need less than 4 MB space for your dictionary of 250 000 words with 1% error probability.
The bloom filter will correctly answer "is in dictionary" if the word actually is contained in dictionary. If dictionary does not have the word, the bloom filter may falsely give answer "is in dictionary" with some small probability.

A very efficient way to store a directory is a Directed Acyclic Word Graph (DAWG).
Here are some links:
Directed Acyclic Word Graph or DAWG description with sourcecode
Construction of the CDAWG for a Trie
Implementation of directed acyclic word graph

You'll be wanting some sort of trie. Perhaps a ternary search trie would be good I think. They give very fast look-up and low memory usage. This paper gives some more info about TSTs. It also talks about sorting so not all of it will apply. This article might be a little more applicable. As the article says, TSTs
combine the time efficiency of digital
tries with the space efficiency of
binary search trees.
As this table shows, the look-up times are very comparable to using a hash table.

You could also use the Android NDK and do the structure in C or C++.

The devices that I worked basically worked from a binary compressed file, with a topology that resembled the structure of a binary tree. At the leafs, you would have the Huffmann compressed text. Finding a node would involve having to skip to various locations of the file, and then only load the portion of the data really needed.

Very cool idea as suggested by "Antti Huima" trying to Store dictionary words using long. and then search using binary search.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.