I'm comparing some senses with RitaWordNet and using SenseRelate::AllWords to word sense disambiguate them, but I'm in trouble. I can't figure out how to compare the output from RitaWordnet with AllWords script.
Rita give me senseid, name, description, pos/bestPos (adjective, verb, noun etc) but not sense number (#1,#2,#3..) The output I I get is like this:
"user","n", "someone who use something..".
::AllWords can't retrieve description, but (wsd.pl) give me
name#pos#sensenumber ("User#n#1").
Which was what I was hoping for actually, but then I realized that Rita doesn't support sense numbers (strangely).
So now I'm a little stuck on how to compare them, how to determine if they are the same sense. Any ideas on how to solve this?
I realized that WordNet returns a sorted list based on nouns, verbs etc.
Example: "story"
Noun: narrative, narration, story, tale #1
Noun: story #2
Noun: floor, level, storey, story #3
So now I can simply add sense number 1, 2 and 3 to the first three nouns (and the rest will continune of course). Then I can compare this with user#n#1 from the word sense disambiguation. The creators of RitaWordNet responded that they would like to implement the sense number in their api, but they don't have an upcoming relase for a new version so this may take a while.
Related
I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.
I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.
The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
What is the best tool that can do text simplification using Java?
Here is an example of text simplification:
John, who was the CEO of a company, played golf.
↓
John played golf. John was the CEO of a company.
I see your problem as a task of converting complex or compound sentence into simple sentences.
Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
John CEO
John played
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
Answer to 3rd comment:
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.
Next step is traversing from CEO-6 part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John was the CEO a company
Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.
With same approach, you'll get second sentence
John played golf
I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)
Still I thought I should
think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles,
mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.
Goal: given a sentence, extract subsentences from it.
Example: John, who was the ceo of the company, played Golf.
Expected output: John was the CEO of the company. John played Golf.
Here is my model of what is happening here written out in the form of model assumptions:
(axioms?)
MA1. Simple sentences can be expanded by inserting subsentences.
MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities.
MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.
Given this model, the algorithm can be straightforward at least to address the simple cases first.
DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
EXTRACT: If you find two commas, generate two sentences:
2.1 EXTRACT-BASE: base sentence:
delete everything out between the two commas, You get the base sentence.
2.2 EXTRACT-EXTENSION: extension sentence:
take everything inside the extension sentence, replace 'who' with the word right before it.
That is your second sentence.
PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.
Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.
So let us expand and complicate the situation a bit.
Compounding cases:
Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.
As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO!
That brings us to the complicating case that our algorithm will fail. Before going there,
let me create a simpler version of 2 that WILL work.
Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.
Example 4. John, the CEO of the company, played Golf with Ram, the CFO.
Wait we are not done yet!
Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.
To allow for this I need to extend my model assumptions:
MA4. More than one entities may be expanded likewise, but should not cause confusion because the
extension clause occurs right next to the entity being informed about. (accounts for example 3)
MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)
MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.
Now how do we handle these complications in our algorithm?
Try this:
SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS:
If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left.
At this point you should have list with base sentence and one or more extension sentences.
PROCESS_EXTENSIONS:
For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword.
If extension does not have a 'who is' or 'which is', place the leading word and and an is.
PRINT: all extension sentences first and then the base sentences.
Not scary.
When I get some time in the next few days, I will add a python implementation.
Thank you
Ravi Annaswamy
You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!
Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:
John is assumed to be the name of a being. The race of John is unknown. John played
golf at some point in the past. Golf is assumed to refer to the ball
game called golf, but the variant of golf that John played is unknown.
At some point in the past John was the CEO of a company. CEO is assumed to
mean "Chief Executive Officer" in the context of a company but this is
not specified. The company is unknown.
In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.
You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library
I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.
I have string containing a list of name like below:
"John asked Kim, Kelly, Lee and Bob about the new year plans". The number of names in the list can very.
How can I localize this in Java?
I am thinking about ResourceBundle and MessageFormat. How will I write the pattern for this in MessageFormat?
Is there any better approach?
Localizing an (inline) list is more than just translating the word “and.” CLDR deals with the issue of formatting lists, check out their page on lists. I’m afraid ICU doesn’t have support to this yet, so you might need to code it separately.
Another issue is that you cannot expect to be able to use names as such in sentences like this. Many languages require the object to be in an inclined form, for example. In Finnish, your sample sentence would read as “John kysyi Kimiltä, Kellyltä, Leeltä ja Bobilta uudenvuoden suunnitelmista.” So you may need to find out and include different inclined forms of the names. Moreover, if the language used does not have Latin alphabet, you may need transliterated forms of the names (e.g., in Arabic, John is جون). There are other problems as well. In Russian, the verb corresponding to “asked” depends on the gender of the subject (e.g., спросила vs. спросил).
I know this sounds complex, but localization is often complex. If you target a limited set of languages only, things can be much easier, so it is important to defined your goals—perhaps accepting some simplifications that may result in grammatically incorrect expressions. But for localization that is to cover a wide range languages, you may need to make the generating function localized. That is, you would have, for each language, a function that accepts a list of names as arguments and returns a string representing the statement, possibly using resource files containing information (transliterated form, different inclined form, gender) about proper names that may appear.
In some situations, you might even consider generating the sentence in English, then sending it to an online translator. For example, Google Translator can deal with some of the issues that I mentioned. It surely produces wrong translations a lot, but for sentences with grammatically very simple structure, it might be a pragmatic solution, if you can accept some amount of errors. If you consider trying this, make sure you test sufficiently how the automatic translator can handle the specific sentences you will use. Quite often you can improve the results by reformulating the sentences. Dividing a sentence with several clauses into separate sentences often helps. But even your simple sentence causes problems in automatic translation.
You might avoid some complications if you can reformulate the sentence structure, e.g. so that all the nouns appear in the subject position and you avoid “packed” expressions like “new year plans.” For example, “John asked what plans Kim, Kelly, Lee, and Bob have for the new year” would be simpler, both for automatic translation and for pattern-based localization.
You could do something like:
"{0} asked {1} about the new year plans"
where 0 is the first name and 1 is a comma-separated list of the other names.
Hope this helps.
I see an answer was already accepted, I'm just adding this here as an alternative. The code has hard coded values for the data, but is only meant to present an idea that can be refined:
MessageFormat people = new MessageFormat("{0} asked {1,choice,0#no one|1#{2}|2#{2} and {3}|2<{2}, and {3}} about the new year plans");
String john = "John";
Object[][] parties = new Object[][] { {john, 0}, {john, 1, "Kim"}, {john, 2, "Kim", "Kelly}, {john, 4, "Kim, Kelly, Lee", "Bob"}};
for (final Object[] strings : parties) {
System.out.println(people.format(strings));
}
This outputs the following:
John asked no one about the new year plans
John asked Kim about the new year plans
John asked Kim and Kelly about the new year plans
John asked Kim, Kelly, Lee, and Bob about the new year plans
Determining the number of names that is used for the 2nd argument and creating the comma-delimited string for the 3rd argument isn't displayed in that sample, but can easily be done instead of using the hard coded values I used.
For localization, the normal approach is to use external language packs, which is a file contains the text you're going to display, assign each text a name/key, then load the text in the program by the key.
You could combine your ResourceBundle (for I18N) with a MessageFormat (to replace placeholders with the names) : "{0} asked {1} about the new year plans"
It would be up to you to prepare the names beforehand, though.
I think I'm facing a paradox here.
What I'm trying to do is when I receive/make a call, I have the number, so I need to know if its an international number, if its a local number, etc.
The problem is:
For me to know if a number is international, I need to parse it and check its length, but, the length differs from country to country, so, should I do a method that parses and recognizes for each country? (Unfunctional in my opinion);
For me to know if its a local number, I need the area code, so I have to make the same thing, parse the number and check the lenght, get the first numbers based on the area code lenght;
Its kinda hard to find the solution for this. The library libphonenumber offers a lot of usefull classes, but the one that I thought that could help me, took me to another paradox.
The method phoneUtil.parse(number, countryAcronym) returns the number with its country code, but what it does is, if I pass the number with the acronym "US" it return the number with country code '1', now if I change the acronym to "BR" it changes the number and return '55' that is the country code for Brazil. So, anyways, I need the country acronym based on the number I get.
EX:
numberReturned = phoneUtil.parse(phoneNumber, "US");
phoneUtil.format(numberReturned, PhoneNumberFormat.INTERNATIONAL);
The above code, returns the number with the US country code but now if I change the "US" to any other country acronym it will return the same number but with the country code of that country.
I know that this lib is not supposed to guess from which country the number is (THAT WOULD BE AWESOME!!), but thats what I need.
This is really making my mind goes crazy. I need good advices from the wise mages of SO.
If you please could help me with a good decision, I'd be so thankfull.
Thanks.
PS: If you already use libphonenumber and has more experience with this, please guide me on which class to use, if there is one capable of solving this problem. =)
1) The second parameter to phoneUtil.parse must match the country you're currently in - it's used if the phone number received does not include an international prefix. That's why you get different results when you change the parameter: the phone number you pass it does not contain such a prefix, so it just uses what you've told it.
Any parsing solution set to determine if the phone number is international or not will need to be aware of this: depending on the source, even a national number may be represented with the international dialing prefix (usually abstracted as +, since it differs between countries, but this is not guaranteed).
2) For area code parsing, there is no universal standard; some countries don't use them, and even within a country, area codes may have differing lengths (e.g. Germany). I'm not aware of an international library for this - and a quick search doesn't find anything (though that doesn't mean one does not exist). You might need to roll your own here; if you only need to support a single country, this shouldn't be too hard.
Suppose I've got a sentence that I'm forming dynamically based on data passed in from someone else. Let's say it's a list of food items, like: ["apples", "pears", "oranges"] or ["bread", "meat"]. I want to take these sentences and form a sentence: "I like to eat apples, pears and oranges" and "I like to eat bread and meat."
It's easy to do this for just English, but I'm fairly sure that conjunctions don't work the same way in all languages (in fact I bet some of them are wildly different). How do you localize a sentence with a dynamic number of items, joined in some manner?
If it helps, I'm working on this for Android, so you will be able to use any libraries that it provides (or one for Java).
I would ask native speakers of the languages you want to target. If all the languages use the same construct (a comma between all the elements except for the last one), then simply build a comma separated list of the n - 1 elements, and use an externalized pattern and use java.util.MessageFormat to build the whole sentence with the n - 1 elements string as first argument and the nth element as second argument. You might also externalize the separator (comma in English) to make it more flexible if needed.
If some languages use another construct, then define several Strategy implementations, externalize the name of the strategy in order to know which strategy to use for a given locale, and then ask the appropriate strategy to format the list for you. The strategy for English would use the algorithm described above. The strategy for another language would use a different way of joining the elements together, but it could be reused for several languages using externalized patterns if those languages use the same construct but with different words.
Note that it's because it's so difficult that most of the programs simply do List of what I like: bread, meat.
The problem, of course, is knowing the other languages. I don't know how you'd do it accurately without finding native speakers for your target languages. Google Translate is an option, but you should still have someone who speaks the language proofread it.
Android has this built-in functionality in the Plurals string resources.
Their documentation provides examples in English (res/valules/strings.xml) and also in another language (placed in res/values-pl/strings.xml)
Valid quantities: {zero, one, two, few, many, other}
Their example:
<?xml version="1.0" encoding="utf-8"?>
<resources>
<plurals name="numberOfSongsAvailable">
<item quantity="one">One song found.</item>
<item quantity="other">%d songs found.</item>
</plurals>
</resources>
So it seems like their approach is to use longer sentence fragments than just the individual words and plurals.
Very difficult. Take french for example: it uses articles in front of each noun. Your example would be:
J'aime manger des pommes, des poires et des oranges.
Or better:
J'aime les pommes, les poires et les oranges.
(Literally: I like the apples, the pears and the oranges).
So it's no longer only a problem of conjunctions. It's a matter of different grammatical rules that goes beyond conjunctions. Needless to say that other languages may raise totally different issues.
This is unfortunately nearly impossible to do. That is because of conjugation on one side and different forms depending on gender on the other. English doesn't have that distinction but many (most of?) other languages do. To make the matter worse, verb form might depend on the following noun (both gender-dependent form and conjugation).
Instead of such nice sentences, we tend to use lists of things, for example:
The list of things I would like to eat:
apples
pears
oranges
I am afraid that this is the only reasonable thing to do (variable, modifiable list).