Clause Segmentation using Stanford OpenIE - java

I'm in a search of a good tool for segmenting complex sentences into clauses. Since I use CoreNLP tools for parsing, I got to know that OpenIE deals with clause segmentation in the process of extracting the relation triples from a sentence. Currently, I use the sample code provided in the OpenIEDemo class from the github repository but it doesn't properly segment the sentence into clauses.
Here is the code:
// Create the Stanford CoreNLP pipeline
Properties props = PropertiesUtils.asProperties(
"annotators", "tokenize,ssplit,pos,lemma,parse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//Annotate sample sentence
text = "I don't think he will be able to handle this.";
Annotation doc = new Annotation(text);
pipeline.annotate(doc);
// Loop over sentences in the document
int sentNo = 0;
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
List<SentenceFragment> clauses = new OpenIE(props).clausesInSentence(sentence);
for (SentenceFragment clause : clauses) {
System.out.println("Clause: "+clause.toString());
}
}
I expect the get as output three clauses:
I don't think
he will be able
to handle this
instead, the code returns the exact same input:
I do n't think he will be able to handle this
However, the sentence
Obama is born in Hawaii and he is no longer our president.
gets two clauses:
Obama is born in Hawaii and he is no longer our president
he is no longer our president
(seems that the coordinating conjunction is a good segmentation indicator)
Is OpenIE generally used for clause segmentation and if so, how to do it properly?
Any other practical approaches/tools on clause segmentation are welcome. Thanks in advance.

So, the clause segmenter is a bit more tightly integrated with OpenIE than the name would imply. The goal of the module is to produce logically entailed clauses, which can then be shortened into logically entailed sentence fragments. Going through your two examples:
I don't think he will be able to handle this.
None of the three clauses are I think entailed from the original sentence:
"I don't think" -- you likely still "think," even if you don't think something is true.
"He will be able" -- If you "think the world is flat," it doesn't mean that the world is flat. Similarly, if you "think he'll be able" it doesn't mean he'll be able.
"to handle this" -- I'm not sure this is a clause... I'd group this with "He will be able to handle this," with "able to" being treated as a single verb.
Obama is born in Hawaii and he is no longer our president.
Naturally the two clauses should be "Obama was born in Hawaii" and "He is no longer our president." Nonetheless, the clause splitter outputs the original sentence in place of the first clause, in expectation that the next step of the OpenIE extractor will strip off the "conj:and" edge.

Have you seen this Stanford CoreNLP parse tree visualization tool? http://nlpviz.bpodgursky.com/
I don't program, but I've been looking for CoreNLP tag groups that might signify an independent clause (stand on its own).
Your e.g.:
I don't think he will be able to handle this -
I don't think
S-NP-VP
He will be able
S-NP-VP
Handle this
VP-VB-NP
Another e.g. Researchers are developing algorithms to harness the force from a (MRI) to steer millimeter-sized robots
Researchers are developing
S-NP-VP
Harness the force
VP-NN-NP
steer millimeter-sized robots
VP-VB-NP
The red line is for the first layer, and the blue line is for the second layer
the red line is for the first layer
S-NP-VP
the blue line is for the second layer
S-NP-VP
Some metal ions can be harmful to cells, whereas others are necessary for biochemical reactions
Some metal ions can be harmful
S-NP-DT
Others are necessary
S-NP-NNS
But how that is determined is often based on questioning that can be subject to interpretations and many other states have laws that players be kept out different numbers of days.
how that is determined is often based on questioning
S-SBAR-VP
many other states have
S-NP-VB
kept out different numbers
VP-VPN-NP
For instance, past data on older humans and non-human primates have suggested that dietary carotenoids could slow cognitive decline.
past data have suggested
S-NP-VP
dietary carotenoids could slow
S-NP-VP
Combinations that I've noticed:
S-NP-VP
S-NP-DT
S-NP-NNS
S-SBAR-VP
S-VP-VB
VP-VPN-NP
VP-NN-NP
VP-VB-NP

Related

Determine the Plurality of a Noun/Verb

I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.
I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.
The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).

Machine Learning Classification of Lists of Strings in JAVA without any context surrounding them

I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?
The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)

Tools for text simplification (Java) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
What is the best tool that can do text simplification using Java?
Here is an example of text simplification:
John, who was the CEO of a company, played golf.
↓
John played golf. John was the CEO of a company.
I see your problem as a task of converting complex or compound sentence into simple sentences.
Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
John CEO
John played
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
Answer to 3rd comment:
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.
Next step is traversing from CEO-6 part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John was the CEO a company
Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.
With same approach, you'll get second sentence
John played golf
I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)
Still I thought I should
think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles,
mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.
Goal: given a sentence, extract subsentences from it.
Example: John, who was the ceo of the company, played Golf.
Expected output: John was the CEO of the company. John played Golf.
Here is my model of what is happening here written out in the form of model assumptions:
(axioms?)
MA1. Simple sentences can be expanded by inserting subsentences.
MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities.
MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.
Given this model, the algorithm can be straightforward at least to address the simple cases first.
DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
EXTRACT: If you find two commas, generate two sentences:
2.1 EXTRACT-BASE: base sentence:
delete everything out between the two commas, You get the base sentence.
2.2 EXTRACT-EXTENSION: extension sentence:
take everything inside the extension sentence, replace 'who' with the word right before it.
That is your second sentence.
PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.
Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.
So let us expand and complicate the situation a bit.
Compounding cases:
Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.
As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO!
That brings us to the complicating case that our algorithm will fail. Before going there,
let me create a simpler version of 2 that WILL work.
Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.
Example 4. John, the CEO of the company, played Golf with Ram, the CFO.
Wait we are not done yet!
Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.
To allow for this I need to extend my model assumptions:
MA4. More than one entities may be expanded likewise, but should not cause confusion because the
extension clause occurs right next to the entity being informed about. (accounts for example 3)
MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)
MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.
Now how do we handle these complications in our algorithm?
Try this:
SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS:
If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left.
At this point you should have list with base sentence and one or more extension sentences.
PROCESS_EXTENSIONS:
For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword.
If extension does not have a 'who is' or 'which is', place the leading word and and an is.
PRINT: all extension sentences first and then the base sentences.
Not scary.
When I get some time in the next few days, I will add a python implementation.
Thank you
Ravi Annaswamy
You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!
Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:
John is assumed to be the name of a being. The race of John is unknown. John played
golf at some point in the past. Golf is assumed to refer to the ball
game called golf, but the variant of golf that John played is unknown.
At some point in the past John was the CEO of a company. CEO is assumed to
mean "Chief Executive Officer" in the context of a company but this is
not specified. The company is unknown.
In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.
You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library
I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.

How can you localize a sentence with a dynamic number of conjunctions?

Suppose I've got a sentence that I'm forming dynamically based on data passed in from someone else. Let's say it's a list of food items, like: ["apples", "pears", "oranges"] or ["bread", "meat"]. I want to take these sentences and form a sentence: "I like to eat apples, pears and oranges" and "I like to eat bread and meat."
It's easy to do this for just English, but I'm fairly sure that conjunctions don't work the same way in all languages (in fact I bet some of them are wildly different). How do you localize a sentence with a dynamic number of items, joined in some manner?
If it helps, I'm working on this for Android, so you will be able to use any libraries that it provides (or one for Java).
I would ask native speakers of the languages you want to target. If all the languages use the same construct (a comma between all the elements except for the last one), then simply build a comma separated list of the n - 1 elements, and use an externalized pattern and use java.util.MessageFormat to build the whole sentence with the n - 1 elements string as first argument and the nth element as second argument. You might also externalize the separator (comma in English) to make it more flexible if needed.
If some languages use another construct, then define several Strategy implementations, externalize the name of the strategy in order to know which strategy to use for a given locale, and then ask the appropriate strategy to format the list for you. The strategy for English would use the algorithm described above. The strategy for another language would use a different way of joining the elements together, but it could be reused for several languages using externalized patterns if those languages use the same construct but with different words.
Note that it's because it's so difficult that most of the programs simply do List of what I like: bread, meat.
The problem, of course, is knowing the other languages. I don't know how you'd do it accurately without finding native speakers for your target languages. Google Translate is an option, but you should still have someone who speaks the language proofread it.
Android has this built-in functionality in the Plurals string resources.
Their documentation provides examples in English (res/valules/strings.xml) and also in another language (placed in res/values-pl/strings.xml)
Valid quantities: {zero, one, two, few, many, other}
Their example:
<?xml version="1.0" encoding="utf-8"?>
<resources>
<plurals name="numberOfSongsAvailable">
<item quantity="one">One song found.</item>
<item quantity="other">%d songs found.</item>
</plurals>
</resources>
So it seems like their approach is to use longer sentence fragments than just the individual words and plurals.
Very difficult. Take french for example: it uses articles in front of each noun. Your example would be:
J'aime manger des pommes, des poires et des oranges.
Or better:
J'aime les pommes, les poires et les oranges.
(Literally: I like the apples, the pears and the oranges).
So it's no longer only a problem of conjunctions. It's a matter of different grammatical rules that goes beyond conjunctions. Needless to say that other languages may raise totally different issues.
This is unfortunately nearly impossible to do. That is because of conjugation on one side and different forms depending on gender on the other. English doesn't have that distinction but many (most of?) other languages do. To make the matter worse, verb form might depend on the following noun (both gender-dependent form and conjugation).
Instead of such nice sentences, we tend to use lists of things, for example:
The list of things I would like to eat:
apples
pears
oranges
I am afraid that this is the only reasonable thing to do (variable, modifiable list).

semantic similarity between sentences

I'm doing a project. I need any opensource tool or technique to find the semantic similarity of two sentences, where I give two sentences as an input, and receive score (i.e.,semantic similarity) as an output. Any help?
Salma, I'm afraid this is not the right forum for your question as it's not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first.
Apart from that, your question is not precise enough, and I'll explain what I mean by that. I assume that your project is about computing the semantic similarity between sentences and not about something else to which semantic similarity is just one thing among many. If this is the case, then there are a few things to consider: First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. There are numerous different views and definitions of it, all depending on the type of problem to be solved, the tools and techniques which are at hand, and the background of the one approaching this task, etc. Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact opposite of 1, still it is about Pete and Rob (not) finding a dog. 3 is about Pete and Rob, but in a completely different context. 4 is about find a dog near the station, although the finder being someone else. 5 is about Pete, Rob, a dog, and a 'finding' event but in a different way than in 1. As for me, I would not be able to rank these examples according to their similarity even without having to write a computer program.
In order to compute semantic similarity you need to first decide what you want to be treated as 'semantically similar' and what not. In order to compute semantic similarity on the sentence level, you ideally would compare some kind of meaning representation of the sentences. Meaning representation normally come as logic formula and are extremely complex to generate. However, there are tools which attempt to do this, e.g. Boxer
As a simplistic but often practical approach, you would define semantic similarity as the sum of the similarities between the words in one sentence and the other. This makes the problem a lot easier, although there are still some difficult issues to be addressed since semantic similarity of words is just as badly defined as that of sentences. If you want to get an impression of this, take a look into the book 'Lexical Semantics' by D.A. Cruse (1986). However, there are quite a number of tools and techniques to compute the semantic similarity between word. Some of them define it basically as the negative distance of two words in a taxonomy like Word Net or the Wikipedia taxonomy (see this paper which describes an API for this). Others compute semantic similarity by using some statistical measures calculated over large text corpora. They are based on the insight that similar words occur in similar context. A third approach to calculating semantic similarity between sentences or words is concerned with vector space models which you may know from information retrieval. To get an overview about these latter techniques, take a look at chapter 8.5 in the book Foundations of statistical natural language processing by Manning and Schütze.
Hope this gets you off on your feet for now.
I have developed a simple open-source tool that does the semantic comparison according to categories:
https://sourceforge.net/projects/semantics/files/
It works with sentences of any length, is simple, stable, fast, small in size...
Here is a sample output:
Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
is: 1.0000000000
Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
is: 0.7363210405107239
Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
is: 0.0
Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
is: 0.29160592175990213
USAGE:
import semantics.Compare;
public class USAGE {
public static void main(String[] args) {
String a = "This is a first sentence.";
String b = "This is a second one.";
Compare c = new Compare(a,b);
System.out.println("Similarity between the sentences\n-"+a+"\n-"+b+"\n is: " + c.getResult());
}
}
You can try using the UMBC Semantic Similarity Service which is based on WordNet KB.
There are UMBC STS (Semantic Textual Similarity) Service. Here is the link http://swoogle.umbc.edu/StsService/sts.html
Regards,

Categories