semantic similarity between sentences - java

I'm doing a project. I need any opensource tool or technique to find the semantic similarity of two sentences, where I give two sentences as an input, and receive score (i.e.,semantic similarity) as an output. Any help?

Salma, I'm afraid this is not the right forum for your question as it's not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first.
Apart from that, your question is not precise enough, and I'll explain what I mean by that. I assume that your project is about computing the semantic similarity between sentences and not about something else to which semantic similarity is just one thing among many. If this is the case, then there are a few things to consider: First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. There are numerous different views and definitions of it, all depending on the type of problem to be solved, the tools and techniques which are at hand, and the background of the one approaching this task, etc. Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact opposite of 1, still it is about Pete and Rob (not) finding a dog. 3 is about Pete and Rob, but in a completely different context. 4 is about find a dog near the station, although the finder being someone else. 5 is about Pete, Rob, a dog, and a 'finding' event but in a different way than in 1. As for me, I would not be able to rank these examples according to their similarity even without having to write a computer program.
In order to compute semantic similarity you need to first decide what you want to be treated as 'semantically similar' and what not. In order to compute semantic similarity on the sentence level, you ideally would compare some kind of meaning representation of the sentences. Meaning representation normally come as logic formula and are extremely complex to generate. However, there are tools which attempt to do this, e.g. Boxer
As a simplistic but often practical approach, you would define semantic similarity as the sum of the similarities between the words in one sentence and the other. This makes the problem a lot easier, although there are still some difficult issues to be addressed since semantic similarity of words is just as badly defined as that of sentences. If you want to get an impression of this, take a look into the book 'Lexical Semantics' by D.A. Cruse (1986). However, there are quite a number of tools and techniques to compute the semantic similarity between word. Some of them define it basically as the negative distance of two words in a taxonomy like Word Net or the Wikipedia taxonomy (see this paper which describes an API for this). Others compute semantic similarity by using some statistical measures calculated over large text corpora. They are based on the insight that similar words occur in similar context. A third approach to calculating semantic similarity between sentences or words is concerned with vector space models which you may know from information retrieval. To get an overview about these latter techniques, take a look at chapter 8.5 in the book Foundations of statistical natural language processing by Manning and Schütze.
Hope this gets you off on your feet for now.

I have developed a simple open-source tool that does the semantic comparison according to categories:
https://sourceforge.net/projects/semantics/files/
It works with sentences of any length, is simple, stable, fast, small in size...
Here is a sample output:
Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
is: 1.0000000000
Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
is: 0.7363210405107239
Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
is: 0.0
Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
is: 0.29160592175990213
USAGE:
import semantics.Compare;
public class USAGE {
public static void main(String[] args) {
String a = "This is a first sentence.";
String b = "This is a second one.";
Compare c = new Compare(a,b);
System.out.println("Similarity between the sentences\n-"+a+"\n-"+b+"\n is: " + c.getResult());
}
}

You can try using the UMBC Semantic Similarity Service which is based on WordNet KB.
There are UMBC STS (Semantic Textual Similarity) Service. Here is the link http://swoogle.umbc.edu/StsService/sts.html
Regards,

Related

Clause Segmentation using Stanford OpenIE

I'm in a search of a good tool for segmenting complex sentences into clauses. Since I use CoreNLP tools for parsing, I got to know that OpenIE deals with clause segmentation in the process of extracting the relation triples from a sentence. Currently, I use the sample code provided in the OpenIEDemo class from the github repository but it doesn't properly segment the sentence into clauses.
Here is the code:
// Create the Stanford CoreNLP pipeline
Properties props = PropertiesUtils.asProperties(
"annotators", "tokenize,ssplit,pos,lemma,parse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//Annotate sample sentence
text = "I don't think he will be able to handle this.";
Annotation doc = new Annotation(text);
pipeline.annotate(doc);
// Loop over sentences in the document
int sentNo = 0;
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
List<SentenceFragment> clauses = new OpenIE(props).clausesInSentence(sentence);
for (SentenceFragment clause : clauses) {
System.out.println("Clause: "+clause.toString());
}
}
I expect the get as output three clauses:
I don't think
he will be able
to handle this
instead, the code returns the exact same input:
I do n't think he will be able to handle this
However, the sentence
Obama is born in Hawaii and he is no longer our president.
gets two clauses:
Obama is born in Hawaii and he is no longer our president
he is no longer our president
(seems that the coordinating conjunction is a good segmentation indicator)
Is OpenIE generally used for clause segmentation and if so, how to do it properly?
Any other practical approaches/tools on clause segmentation are welcome. Thanks in advance.
So, the clause segmenter is a bit more tightly integrated with OpenIE than the name would imply. The goal of the module is to produce logically entailed clauses, which can then be shortened into logically entailed sentence fragments. Going through your two examples:
I don't think he will be able to handle this.
None of the three clauses are I think entailed from the original sentence:
"I don't think" -- you likely still "think," even if you don't think something is true.
"He will be able" -- If you "think the world is flat," it doesn't mean that the world is flat. Similarly, if you "think he'll be able" it doesn't mean he'll be able.
"to handle this" -- I'm not sure this is a clause... I'd group this with "He will be able to handle this," with "able to" being treated as a single verb.
Obama is born in Hawaii and he is no longer our president.
Naturally the two clauses should be "Obama was born in Hawaii" and "He is no longer our president." Nonetheless, the clause splitter outputs the original sentence in place of the first clause, in expectation that the next step of the OpenIE extractor will strip off the "conj:and" edge.
Have you seen this Stanford CoreNLP parse tree visualization tool? http://nlpviz.bpodgursky.com/
I don't program, but I've been looking for CoreNLP tag groups that might signify an independent clause (stand on its own).
Your e.g.:
I don't think he will be able to handle this -
I don't think
S-NP-VP
He will be able
S-NP-VP
Handle this
VP-VB-NP
Another e.g. Researchers are developing algorithms to harness the force from a (MRI) to steer millimeter-sized robots
Researchers are developing
S-NP-VP
Harness the force
VP-NN-NP
steer millimeter-sized robots
VP-VB-NP
The red line is for the first layer, and the blue line is for the second layer
the red line is for the first layer
S-NP-VP
the blue line is for the second layer
S-NP-VP
Some metal ions can be harmful to cells, whereas others are necessary for biochemical reactions
Some metal ions can be harmful
S-NP-DT
Others are necessary
S-NP-NNS
But how that is determined is often based on questioning that can be subject to interpretations and many other states have laws that players be kept out different numbers of days.
how that is determined is often based on questioning
S-SBAR-VP
many other states have
S-NP-VB
kept out different numbers
VP-VPN-NP
For instance, past data on older humans and non-human primates have suggested that dietary carotenoids could slow cognitive decline.
past data have suggested
S-NP-VP
dietary carotenoids could slow
S-NP-VP
Combinations that I've noticed:
S-NP-VP
S-NP-DT
S-NP-NNS
S-SBAR-VP
S-VP-VB
VP-VPN-NP
VP-NN-NP
VP-VB-NP

Senserelate targetword: provide the "best" alternative for end-users

Intro to my problem: users can search for terms and RitaWordNet provides a method called getSenseIds() to get the related senses. By now I am using WS4J (WordNet Similarity for Java, http://code.google.com/p/ws4j/) that has different algorithms to define distance. A search for "user" has this result:
user
exploiter
drug user
http://wordnetweb.princeton.edu/perl/webwn?s=user&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0
The Lin-distance is measured by comparing two terms in WS4J (with targetWord I assume?):
Similarity between: user and: user = 1.7976931348623157E308
Similarity between: user and: exploiter = 0.1976958835785797
I would like to return to the end-user a suggestion that the "user" sense is the most relevant/correct answer, but the problem is that this depends on the rest of the sentence.
Example: "The old man was a regular user of public transport", "The young man became became a drug user while studying NLP..".
I assume that the senserelate project has something included that I'm missing. This thread also got picked up during my search:
word disambiguation algorithm (Lesk algorithm)
Hopefully someone got my question :)
You might want to try WordNet::SenseRelate::AllWords - there's an online demo at http://maraca.d.umn.edu

Formula manipulation algorithm

I am wanting to make a program that will when given a formula, it can manipulate the formula to make any value (or in the case of a simultaneous formula, a common value) the subject of the formula.
For example if given:
a + b = c
d + b = c
The program should therefore say:
b = c - a, d = c - b etc.
I'm not sure if java can do this automatically or not when I give the original formula as input. I am not really interested in solving the equation and getting the result of each variable, I am just interested in returning a manipulated formula.
Please let me know if I need to make an algorithm or not for this, and if so, how would I go about doing this. Also, if there are any helpful links that you might have, please post them.
Regards
Take a look at JavaCC. It's a little daunting at first but it's the right tool for something like this. Plus there are already examples of what you are trying to achieve.
Not sure what exactly you are after, but this problem in its general problem is hard. Very hard.
In fact, given a set of "formulas" (axioms), and deduction rules (mathematical equivalence operations), we cannot deduce if a given formula is correct or not. This problem is actually undecideable.
This issue was first addressed by Hilbert as Entscheidungsproblem
I read a book called Fluid Concepts and Creative Analogies by Douglas Hofstadter that talked about this sort of algebraic manipulations that would automatically rewrite equations in other ways attempting to join equations to other equations an infinite (yet restricted) number of ways given rules. It was an attempt to prove yet unproven theorems/proofs by brute force.
http://en.wikipedia.org/wiki/Fluid_Concepts_and_Creative_Analogies
Douglas Hofstadter's Numbo program attempts to do what you want. He doesn't give you the source, only describes how it works in detail.
It sounds like you want a program to do what highschool students do when they solve algebraic problems to move from a position where you know something, modifying it and combining it with other equations, to prove something previously unknown. It takes a strong Artificial intelligence to do this. The part of your brain that does this is the Neo Cortex, which does science, and it's operating principle is as of yet not understood.
If you want something that will do what college students do when they manipulate equations in calculus, you'll have to build a fairly strong artificial intelligence.
http://en.wikipedia.org/wiki/Neocortex
When we can do whole-brain emulation of a human neo cortex, I will post the answer here.
Yes, you need to write some algorithm to do this kind of computer algebra. At least
a parser to interpret the input
an algebra model to relate parsed operands ('a', 'b', ...) and operator ('+', '=')
implement any appropriate rule to support the manipulation you wish to do

Programmatical approach in Java for file comparison

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.
More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.
The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.
As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.
This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.
Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.
Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;
Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm
But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.
Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.
There is a copy of the finished paper on GitHub
For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).
Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)
One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.
Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).
A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see
this presentation.
There has been a serious amount of research recently into automating the detection of duplicate bug reports in bug repositories. This is essentially the same problem you are facing. The difference is that you are using binary data. They are similar problems because you will be looking for strings that have the same basic pattern, even though the patterns may have some slight differences. A straight-up distance algorithm probably won't serve you well here.
This paper gives a good summary of the problem as well as some approaches in its citations that have been tried.
ftp://ftp.computer.org/press/outgoing/proceedings/Patrick/apsec10/data/4266a366.pdf
As somebody has pointed out, similarity with known string and bioinformatics problem might help. Longest common substring is very brittle, meaning that one difference can halve the length of such a string. You need a form of string alignment, but more efficient than Smith-Waterman. I would try and look at programs such as BLAST, BLAT or MUMMER3 to see if they can fit your needs. Remember that the default parameters, for these programs, are based on a biology application (how much to penalize an insertion or a substitution for instance), so you should probably look at re-estimating parameters based on your application domain, possibly based on a training set. This is a known problem because even in biology different applications require different parameters (based, for instance, on the evolutionary distance of two genomes to compare). It is also possible, though, that even at default one of these algorithms might produce usable results. Best of all would be to have a generative model of how viruses change and that could guide you in an optimal choice for a distance and comparison algorithm.

Text similarity algorithm

I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.
My comments:
The function should return percentage of the similarity of texts - AGREE
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.
Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.
For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.
You might want to look at several implementations that are described here: Cosine Similarity
You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.
Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep
There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
Here you find a helpful implementation of several algorithms within one library
if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.

Categories