This question already has answers here:
Fuzzy string search library in Java [closed]
(8 answers)
Closed 9 years ago.
I'm looking for a way to do programmatically detect the delta ratio between two strings. I can use string length, but this doesn't give much useful information for like-sized but different inputs. There is a java diff tool on google code Java Diff Utils, but it hasn't been updated since 2011 and I don't need to actually modify the Strings themselves.
I'm attempting to do change detection with threshold values, for instance: Updated string is 42% different than existing string, are you sure you want to proceed?
Does anyone know of a library that could be used for this, or is java-diff-utils my only option? I couldn't find much in apache commons, and googling is returning irrelevant information.
You could use the Levenshtein Distance to calculate how much different two strings are amongst themselves. There's some quite complex math there but the actual code is rather short. You can easily rewrite the code in that wiki in Java.
The difference will be measured in integers, saying how many steps you'd take to turn one string into the other. A step may be a character addition, removal, or replacement with another character. It will tell you the amount of steps it takes, but not which steps, nor in which order. But then again, since you only want to measure the total difference, I'm sure that's enough information for your needs.
edit: one of the commenters (kaos) provided a link to an implementation of Levenshtein Distance in the Apache Commons.
Related
This question already has answers here:
How to evaluate a math expression given in string form?
(26 answers)
Closed 6 years ago.
i have a string with a math function, like "35+20". i want a new double variable that takes in the result of the math function i.e, 55.0 How do i achieve this? this is actually for android, i'm making a calculator..
Manually parse the string and do a calculation at each operator symbol. It will get more complicated when dealing with brackets, however.
If you want to write it yourself, you'll probably want to implement the Shunting Yard Algorithm.
There are also some libraries that handle it for you.
https://github.com/uklimaschewski/EvalEx
Since you have mentioned you are working on a calculator I am assuming that you might not only be interested in just the + operation but on a bunch of other stuffs too.
You can look in to open source GitHub project linked below which provides the JAVA implementation for the stuff you are trying to do https://github.com/uklimaschewski/EvalEx
which can give you a good set of functionality that you desire.
This project takes in a string as an expression and the returns the result in BigDecimal format.
You can always extend it and tweek it to custom suite you needs.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Plural form of a word
Is there some existing class or library for adding "s" to a String if I pass it a number that's not 1? Basically, I have a cat. If I have 1 cat, I need the String "cat". If I have 2 cats, I need the String "cats". It's a simple thing to do myself, but after I did it, I thought there's probably already a library I could import for it. However, as you can see, I am having difficulty putting this into words to Google a name for the package, if it exists. :P It's just that I write this function all the time, I'm wondering if it exists already.
Pluralizing is probably simple enough, with the exception of the list of words like "cactus" or "tooth" that are somewhat special cases, that you don't need an entire API or you can implement yourself. If you really don't want to have a stab at it, there are things like the Inflector library that can do it for you. There are probably others, if you search for "java pluralize".
Javadocs for Inflector here: http://www.atteo.org/static/evo-inflector/apidocs/index.html
Really I'm doing just time words (seconds, minutes, hours) ...
If you have a small / fixed set of words, it is probably simpler to use a static lookup table or something like that. A general solution is too heavy-weight, IMO.
What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.
I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.
Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF
I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.
My comments:
The function should return percentage of the similarity of texts - AGREE
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.
Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.
For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.
You might want to look at several implementations that are described here: Cosine Similarity
You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.
Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep
There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
Here you find a helpful implementation of several algorithms within one library
if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.