Intro to my problem: users can search for terms and RitaWordNet provides a method called getSenseIds() to get the related senses. By now I am using WS4J (WordNet Similarity for Java, http://code.google.com/p/ws4j/) that has different algorithms to define distance. A search for "user" has this result:
user
exploiter
drug user
http://wordnetweb.princeton.edu/perl/webwn?s=user&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=0
The Lin-distance is measured by comparing two terms in WS4J (with targetWord I assume?):
Similarity between: user and: user = 1.7976931348623157E308
Similarity between: user and: exploiter = 0.1976958835785797
I would like to return to the end-user a suggestion that the "user" sense is the most relevant/correct answer, but the problem is that this depends on the rest of the sentence.
Example: "The old man was a regular user of public transport", "The young man became became a drug user while studying NLP..".
I assume that the senserelate project has something included that I'm missing. This thread also got picked up during my search:
word disambiguation algorithm (Lesk algorithm)
Hopefully someone got my question :)
You might want to try WordNet::SenseRelate::AllWords - there's an online demo at http://maraca.d.umn.edu
Related
Suppose I tell my Java program to find the subject of the sentence:
I enjoy spending time with my family.
My program should output:
Tell me more about your family.
How would it go about doing this?
EDIT
I could do this by having an array of String and have that filled with every noun in the English dictionary, but is there a simpler way?
This is way too open-ended a question. But a good place to start would be to learn about Natural Language Processing concepts and then look at using a framework like CoreNLP. It breaks down sentences into a parse tree and you can use this to identify parts of speech and things like the subject of a sentence. This is probably your best bet if you want a reasonably-reliable method.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
What is the best tool that can do text simplification using Java?
Here is an example of text simplification:
John, who was the CEO of a company, played golf.
↓
John played golf. John was the CEO of a company.
I see your problem as a task of converting complex or compound sentence into simple sentences.
Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
John CEO
John played
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
Answer to 3rd comment:
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.
Next step is traversing from CEO-6 part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John was the CEO a company
Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.
With same approach, you'll get second sentence
John played golf
I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)
Still I thought I should
think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles,
mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.
Goal: given a sentence, extract subsentences from it.
Example: John, who was the ceo of the company, played Golf.
Expected output: John was the CEO of the company. John played Golf.
Here is my model of what is happening here written out in the form of model assumptions:
(axioms?)
MA1. Simple sentences can be expanded by inserting subsentences.
MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities.
MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.
Given this model, the algorithm can be straightforward at least to address the simple cases first.
DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
EXTRACT: If you find two commas, generate two sentences:
2.1 EXTRACT-BASE: base sentence:
delete everything out between the two commas, You get the base sentence.
2.2 EXTRACT-EXTENSION: extension sentence:
take everything inside the extension sentence, replace 'who' with the word right before it.
That is your second sentence.
PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.
Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.
So let us expand and complicate the situation a bit.
Compounding cases:
Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.
As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO!
That brings us to the complicating case that our algorithm will fail. Before going there,
let me create a simpler version of 2 that WILL work.
Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.
Example 4. John, the CEO of the company, played Golf with Ram, the CFO.
Wait we are not done yet!
Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.
To allow for this I need to extend my model assumptions:
MA4. More than one entities may be expanded likewise, but should not cause confusion because the
extension clause occurs right next to the entity being informed about. (accounts for example 3)
MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)
MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.
Now how do we handle these complications in our algorithm?
Try this:
SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS:
If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left.
At this point you should have list with base sentence and one or more extension sentences.
PROCESS_EXTENSIONS:
For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword.
If extension does not have a 'who is' or 'which is', place the leading word and and an is.
PRINT: all extension sentences first and then the base sentences.
Not scary.
When I get some time in the next few days, I will add a python implementation.
Thank you
Ravi Annaswamy
You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!
Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:
John is assumed to be the name of a being. The race of John is unknown. John played
golf at some point in the past. Golf is assumed to refer to the ball
game called golf, but the variant of golf that John played is unknown.
At some point in the past John was the CEO of a company. CEO is assumed to
mean "Chief Executive Officer" in the context of a company but this is
not specified. The company is unknown.
In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.
You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library
I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.
I am currently working on a project where I need to create possible answers to WH question based on a given sentence.
Example sentence is:
Anna went to the store.
I need to create possible answers to the following questions.
Who went to the store?
Where did ana go?
I had already implemented the POST-Tagging of the words and I now know the Part of Speech of the word.
I had difficulty on determining what type of noun is the word if it is a noun so I can create a template for my answer.
Example is:
Anna - noun - person: possible answer to question who.
Store - noun - place: possible answer to question where.
I want to implement this using java
You should not try to infer possible questions and their answers from the nouns present in a sentence. Instead, you should infer the type of questions you can ask from the activity described by the verb. In your example, the system should infer that the activity of going requires a subject who goes (possible answer to the "who" question), a place where the subject goes (possible answer to the "where to" question) as well as the time when the subject went there (possible answer to the "when" question) and potentially more (from where? with whom? by what means? which way? etc). Then it should check which answers are provided in the question. In your example, answers to "who" and "where" are provided, but "when" is not. In other words, you should have a mapping from verbs to questions that make sense for each verb.
Then for each question applicable to the verb you should store prepositions that are used to indicate that answer in a sentence. For example, the answer to the "where to" question is often preceded by "to" and an answer to "when" question is often preceded by "at" or "on". Note that subject (here answer to the "who" question) needs special treatment. Also, some verbs can be immediately followed by an object without prepositions and your verb dataset should indicate the question to which they constitute an answer. For example the verb "to enter" can be followed by an object which answer the "where/what" question as in "Anna entered the room". Also, some nouns are exceptions and are never preceded by a preposition. For example "home" as in "Anna went home". You need to treat these specially as well. Another thing to watch out for are idiomatic expressions like "Anna went to great lengths". Again, special treatment is required.
Generally, nouns in English do not have enough structure to allow you to determine what type of thing (e.g. a place, an object, a person, a concept etc) they denote. You would need to have a large dataset that breaks up all the words known to the system into different categories. If you do use a noun set like this, it should have a supporting role to increase system's accuracy.
Relying on verbs and prepositions is far more flexible as it allows the system to handle unknown expressions. For example, someone might say "Anna went to Bob", but "Bob" is not a place. System that infers the role of each element from verb and preposition still handles this situation and properly treats "Bob" as the right answer for the "where to" question.
I will be focusing only on your problem in determining the noun types instead of the approach for answering factoid questions.
SuperSense Tagger is one resource/tool that provides a more specific type/sense information for a POS. You can check the token categories which include noun:person and noun:location. There are wrappers for Java (you can just search for them).
Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/
I'm doing a project. I need any opensource tool or technique to find the semantic similarity of two sentences, where I give two sentences as an input, and receive score (i.e.,semantic similarity) as an output. Any help?
Salma, I'm afraid this is not the right forum for your question as it's not directly related to programming. I recommend that you ask your question again on corpora list. You also may want to search their archives first.
Apart from that, your question is not precise enough, and I'll explain what I mean by that. I assume that your project is about computing the semantic similarity between sentences and not about something else to which semantic similarity is just one thing among many. If this is the case, then there are a few things to consider: First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. There are numerous different views and definitions of it, all depending on the type of problem to be solved, the tools and techniques which are at hand, and the background of the one approaching this task, etc. Consider these examples:
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station.
Pete and Rob both like programming a lot.
Patricia found a dog near the station.
It was a dog who found Pete and Rob under the snow.
Which of the sentences 2-4 are similar to 1? 2 is the exact opposite of 1, still it is about Pete and Rob (not) finding a dog. 3 is about Pete and Rob, but in a completely different context. 4 is about find a dog near the station, although the finder being someone else. 5 is about Pete, Rob, a dog, and a 'finding' event but in a different way than in 1. As for me, I would not be able to rank these examples according to their similarity even without having to write a computer program.
In order to compute semantic similarity you need to first decide what you want to be treated as 'semantically similar' and what not. In order to compute semantic similarity on the sentence level, you ideally would compare some kind of meaning representation of the sentences. Meaning representation normally come as logic formula and are extremely complex to generate. However, there are tools which attempt to do this, e.g. Boxer
As a simplistic but often practical approach, you would define semantic similarity as the sum of the similarities between the words in one sentence and the other. This makes the problem a lot easier, although there are still some difficult issues to be addressed since semantic similarity of words is just as badly defined as that of sentences. If you want to get an impression of this, take a look into the book 'Lexical Semantics' by D.A. Cruse (1986). However, there are quite a number of tools and techniques to compute the semantic similarity between word. Some of them define it basically as the negative distance of two words in a taxonomy like Word Net or the Wikipedia taxonomy (see this paper which describes an API for this). Others compute semantic similarity by using some statistical measures calculated over large text corpora. They are based on the insight that similar words occur in similar context. A third approach to calculating semantic similarity between sentences or words is concerned with vector space models which you may know from information retrieval. To get an overview about these latter techniques, take a look at chapter 8.5 in the book Foundations of statistical natural language processing by Manning and Schütze.
Hope this gets you off on your feet for now.
I have developed a simple open-source tool that does the semantic comparison according to categories:
https://sourceforge.net/projects/semantics/files/
It works with sentences of any length, is simple, stable, fast, small in size...
Here is a sample output:
Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
is: 1.0000000000
Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
is: 0.7363210405107239
Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
is: 0.0
Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
is: 0.29160592175990213
USAGE:
import semantics.Compare;
public class USAGE {
public static void main(String[] args) {
String a = "This is a first sentence.";
String b = "This is a second one.";
Compare c = new Compare(a,b);
System.out.println("Similarity between the sentences\n-"+a+"\n-"+b+"\n is: " + c.getResult());
}
}
You can try using the UMBC Semantic Similarity Service which is based on WordNet KB.
There are UMBC STS (Semantic Textual Similarity) Service. Here is the link http://swoogle.umbc.edu/StsService/sts.html
Regards,