Suppose I've got a sentence that I'm forming dynamically based on data passed in from someone else. Let's say it's a list of food items, like: ["apples", "pears", "oranges"] or ["bread", "meat"]. I want to take these sentences and form a sentence: "I like to eat apples, pears and oranges" and "I like to eat bread and meat."
It's easy to do this for just English, but I'm fairly sure that conjunctions don't work the same way in all languages (in fact I bet some of them are wildly different). How do you localize a sentence with a dynamic number of items, joined in some manner?
If it helps, I'm working on this for Android, so you will be able to use any libraries that it provides (or one for Java).
I would ask native speakers of the languages you want to target. If all the languages use the same construct (a comma between all the elements except for the last one), then simply build a comma separated list of the n - 1 elements, and use an externalized pattern and use java.util.MessageFormat to build the whole sentence with the n - 1 elements string as first argument and the nth element as second argument. You might also externalize the separator (comma in English) to make it more flexible if needed.
If some languages use another construct, then define several Strategy implementations, externalize the name of the strategy in order to know which strategy to use for a given locale, and then ask the appropriate strategy to format the list for you. The strategy for English would use the algorithm described above. The strategy for another language would use a different way of joining the elements together, but it could be reused for several languages using externalized patterns if those languages use the same construct but with different words.
Note that it's because it's so difficult that most of the programs simply do List of what I like: bread, meat.
The problem, of course, is knowing the other languages. I don't know how you'd do it accurately without finding native speakers for your target languages. Google Translate is an option, but you should still have someone who speaks the language proofread it.
Android has this built-in functionality in the Plurals string resources.
Their documentation provides examples in English (res/valules/strings.xml) and also in another language (placed in res/values-pl/strings.xml)
Valid quantities: {zero, one, two, few, many, other}
Their example:
<?xml version="1.0" encoding="utf-8"?>
<resources>
<plurals name="numberOfSongsAvailable">
<item quantity="one">One song found.</item>
<item quantity="other">%d songs found.</item>
</plurals>
</resources>
So it seems like their approach is to use longer sentence fragments than just the individual words and plurals.
Very difficult. Take french for example: it uses articles in front of each noun. Your example would be:
J'aime manger des pommes, des poires et des oranges.
Or better:
J'aime les pommes, les poires et les oranges.
(Literally: I like the apples, the pears and the oranges).
So it's no longer only a problem of conjunctions. It's a matter of different grammatical rules that goes beyond conjunctions. Needless to say that other languages may raise totally different issues.
This is unfortunately nearly impossible to do. That is because of conjugation on one side and different forms depending on gender on the other. English doesn't have that distinction but many (most of?) other languages do. To make the matter worse, verb form might depend on the following noun (both gender-dependent form and conjugation).
Instead of such nice sentences, we tend to use lists of things, for example:
The list of things I would like to eat:
apples
pears
oranges
I am afraid that this is the only reasonable thing to do (variable, modifiable list).
Related
I have a program that is randomly generating sentences based on a bunch of text documents of all the nouns, verbs, adjectives, and adverbs. Does anyone know a way to determine if a noun/verb are plural or singular, or if there any text documents that contain a list of singular nouns/verbs and plural nouns? I'm doing this all in Java, and I have a decent idea of how to get information off of a website, so if there are any websites that could do that as well, I'd also appreciate those.
I am afraid, you cannot solve this by having a fixed list of words, especially verbs. Consider sentences:
You are free. We are free.
In the first one, are is singular, it is plural. Using a proper tagger as #jdaz suggests is the only way how you can do it in a reliable way.
If you work with English or a few other supported languages, StanfordNLP is an excellent choice. If you need a broad language coverage, you can use UDPipe that is natively in C++ but has a Java binding.
The first step would be to look it up in a list. For English you can reduce the size of the list by only including singular nouns, and then apply some basic string processing to find plurals: if your word ends in -s and is not in the list, cut off the -s and look again. If it now is in the list, it was a simple plural (car/cars). If not, continue. If it ends in -ies, remove that, append -y and look again. Now you will capture remedies/remedy. There are a number of such patterns you can use.
Some irregular nouns need to be in an exception list (ox/oxen), but there aren't that many. Some words of course are unspecified, like sheep, data, or police. Here you need to look at the context: if the noun is followed by a singular verb (eg eats, or is), then it would be singular as well.
With (English) verbs you can generally only identify the third person singular (with a similar procedure as used for nouns; you's need a list of exceptions for verbs anding in -s (such as kiss)). Forms of to be are more helpful, but the second person singular is an issue (are). However, unless you have direct speech in your texts, it will not be used very frequently.
Part of speech taggers can also only make these decisions on context, so I don't think they will be much of a help here. It's likely to be overkill. A couple of word lists and simple heuristic rules will probably give you equal or better accuracy using far fewer resources. This is the way these things were done before large amounts of annotated data were available.
In the end it depends on your circumstances. It might be quicker to simply use an existing tagger, but for this limited problem you might get better accuracy and speed with the rule-based approach, (or even a combined one for accuracy).
I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?
The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
I'm comparing some senses with RitaWordNet and using SenseRelate::AllWords to word sense disambiguate them, but I'm in trouble. I can't figure out how to compare the output from RitaWordnet with AllWords script.
Rita give me senseid, name, description, pos/bestPos (adjective, verb, noun etc) but not sense number (#1,#2,#3..) The output I I get is like this:
"user","n", "someone who use something..".
::AllWords can't retrieve description, but (wsd.pl) give me
name#pos#sensenumber ("User#n#1").
Which was what I was hoping for actually, but then I realized that Rita doesn't support sense numbers (strangely).
So now I'm a little stuck on how to compare them, how to determine if they are the same sense. Any ideas on how to solve this?
I realized that WordNet returns a sorted list based on nouns, verbs etc.
Example: "story"
Noun: narrative, narration, story, tale #1
Noun: story #2
Noun: floor, level, storey, story #3
So now I can simply add sense number 1, 2 and 3 to the first three nouns (and the rest will continune of course). Then I can compare this with user#n#1 from the word sense disambiguation. The creators of RitaWordNet responded that they would like to implement the sense number in their api, but they don't have an upcoming relase for a new version so this may take a while.
I have string containing a list of name like below:
"John asked Kim, Kelly, Lee and Bob about the new year plans". The number of names in the list can very.
How can I localize this in Java?
I am thinking about ResourceBundle and MessageFormat. How will I write the pattern for this in MessageFormat?
Is there any better approach?
Localizing an (inline) list is more than just translating the word “and.” CLDR deals with the issue of formatting lists, check out their page on lists. I’m afraid ICU doesn’t have support to this yet, so you might need to code it separately.
Another issue is that you cannot expect to be able to use names as such in sentences like this. Many languages require the object to be in an inclined form, for example. In Finnish, your sample sentence would read as “John kysyi Kimiltä, Kellyltä, Leeltä ja Bobilta uudenvuoden suunnitelmista.” So you may need to find out and include different inclined forms of the names. Moreover, if the language used does not have Latin alphabet, you may need transliterated forms of the names (e.g., in Arabic, John is جون). There are other problems as well. In Russian, the verb corresponding to “asked” depends on the gender of the subject (e.g., спросила vs. спросил).
I know this sounds complex, but localization is often complex. If you target a limited set of languages only, things can be much easier, so it is important to defined your goals—perhaps accepting some simplifications that may result in grammatically incorrect expressions. But for localization that is to cover a wide range languages, you may need to make the generating function localized. That is, you would have, for each language, a function that accepts a list of names as arguments and returns a string representing the statement, possibly using resource files containing information (transliterated form, different inclined form, gender) about proper names that may appear.
In some situations, you might even consider generating the sentence in English, then sending it to an online translator. For example, Google Translator can deal with some of the issues that I mentioned. It surely produces wrong translations a lot, but for sentences with grammatically very simple structure, it might be a pragmatic solution, if you can accept some amount of errors. If you consider trying this, make sure you test sufficiently how the automatic translator can handle the specific sentences you will use. Quite often you can improve the results by reformulating the sentences. Dividing a sentence with several clauses into separate sentences often helps. But even your simple sentence causes problems in automatic translation.
You might avoid some complications if you can reformulate the sentence structure, e.g. so that all the nouns appear in the subject position and you avoid “packed” expressions like “new year plans.” For example, “John asked what plans Kim, Kelly, Lee, and Bob have for the new year” would be simpler, both for automatic translation and for pattern-based localization.
You could do something like:
"{0} asked {1} about the new year plans"
where 0 is the first name and 1 is a comma-separated list of the other names.
Hope this helps.
I see an answer was already accepted, I'm just adding this here as an alternative. The code has hard coded values for the data, but is only meant to present an idea that can be refined:
MessageFormat people = new MessageFormat("{0} asked {1,choice,0#no one|1#{2}|2#{2} and {3}|2<{2}, and {3}} about the new year plans");
String john = "John";
Object[][] parties = new Object[][] { {john, 0}, {john, 1, "Kim"}, {john, 2, "Kim", "Kelly}, {john, 4, "Kim, Kelly, Lee", "Bob"}};
for (final Object[] strings : parties) {
System.out.println(people.format(strings));
}
This outputs the following:
John asked no one about the new year plans
John asked Kim about the new year plans
John asked Kim and Kelly about the new year plans
John asked Kim, Kelly, Lee, and Bob about the new year plans
Determining the number of names that is used for the 2nd argument and creating the comma-delimited string for the 3rd argument isn't displayed in that sample, but can easily be done instead of using the hard coded values I used.
For localization, the normal approach is to use external language packs, which is a file contains the text you're going to display, assign each text a name/key, then load the text in the program by the key.
You could combine your ResourceBundle (for I18N) with a MessageFormat (to replace placeholders with the names) : "{0} asked {1} about the new year plans"
It would be up to you to prepare the names beforehand, though.
Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/