Is there an open source project that handles the process of localizing tokenized string text to other languages, and has complex handling for grammar, spelling (definite, indefinite, plural, singular), also for languages like german handling of masculine, feminine, neuter.
Most localization frameworks do wholesale replace of strings and don't take into account tokenized strings that might refer to objects that in some languages could be masculine/feminine/neuter.
The programming language I'm looking for is Javascript/Java/Actionscript/Python, it'd be nice if there was a programming-language independent data-format for creating the string tables.
To answer your question, I've not heard of any such a framework.
From the limited amount that I understand and have heard about this topic, I'd say that this is beyond the state of the art ... certainly if you are trying to do this across multiple languages.
Here are some relevant resources:
"Open-Source Software and Localization" by Frank Bergman [2005]
Plural forms in GNU gettext - http://www.gnu.org/software/hello/manual/gettext/Plural-forms.html
Related
I found out that Apache Tika often gets confused with Asian languages (e.g. Japanese, Chinese) with some small pieces of English.
Text sample:
サンフランシスコ生まれのチョコレートショップ、ダンデライオン・チョコレートからほうじ茶ブレンドの「ホットチョコレートミックス」が新登場。 ドリンクとしてだけでなく、アイスクリームやお餅にかけてもOK。ほうじ茶ホットチョコレート 150g(約5杯分) ¥2,160 ©Dandelion Chocolate Japan カカオ豆からチョコレートバーになるまでの全工程を一貫して行う「ビーン・トゥー・バー(Bean to Bar)」メーカーの先駆け的存在、ダンデライオン・チョコレート。その人気商品ホットチョコレートミックスは「自宅でホットチョコレートを飲みたい!」という客からの要望を元に開発されたものだ。ほかのチョコレートバー同様、カカオ豆はシングルオリジンにこだわり、ほろ苦さとフルーティーな酸味のバランスがよいドミニカ共和国産を使用。カカオ本来の豊かな風味を味わえる。 このホットチョコレートミックスが新たに出合ったのが、京都・河原町の日本茶専門店「YUGEN」のほうじ茶だ。宇治を中心に各地の生産者を訪れ、厳選された日本茶のみを扱うYUGEN。ほうじ茶も春に収穫された新芽だけで作られる。その茶葉を石臼でじっくりと挽き、パウダー状にしてブレンドすることで、香ばしさだけでなく、旨味、甘み、栄養成分までそのまま溶け込む、深みのある味わいのホットチョコレートが誕生。 オンラインストアでも購入できるので、チョコレートとほうじ茶のおいしいマリアージュを自宅で堪能して。 ダンデライオン・チョコレート ファクトリー&カフェ蔵前 東京都台東区蔵前4-14-6 tel:03-5833-7270 営)10時〜20時(L.O. 19時30分) 不定休 https://dandelionchocolate.jp ※4/12まで休業中。そのほかの店舗詳細情報はウェブサイトで要確認。 ※この記事に記載している価格は、軽減税率8%の税込価格です。 【関連記事】 産地の魅力が際立つ、誘惑の「Bean to Bar」。 マロン×カカオ、ビーン・トゥ・バー発想のモンブラン。
languageDetector.detect() -> en: MEDIUM (0.570320)
languageDetector.detectAll() -> ArrayList(size=1) en: MEDIUM (0.570320)
I'm using default LanguageDetector implementation: OptimaizeLangDetector
Code sample:
LanguageDetector languageDetector = LanguageDetector.getDefaultLanguageDetector().loadModels();
languageDetector.addText(text);
var result = languageDetector.detect().getLanguage();
I figured out that after splitting a text into small pieces (n-gramms) Tika searches for the match in wordLangProbMap. It's built only from 114k n-gramms for all languages. That leads to the situation when most of the Japanese symbols (n-gramms) are not found in that map, while all english n-gramms are correctly identified. As a result, for either Japanese or Chinese texts Tika often identifies language as a French, German, English and so on.
Is there any way to extend list over which the matching is done?
I'm also considering other alternatives if they could behave in a better way. Previously I used lingua, but it really shows dramatic performance drop and resources usage on big texts.
I'm in the process of putting together an android app which will have a Quenya translation out of the box (an Elvish dialect).
If we would like to maintain the maximum compliance with the ISO standards while representing this fantasy world, how should we go about it?
Also, if there is a standard for representing Middle Earth that has already been agreed on by the community, what is it?
Perhaps we would:
require more letters for the language or country codes (like "TME" for "Tolkien's Middle Earth" or "MEGN" for "Middle Earth, Gondor")
I am not clear if there are community-agreed standard for such country code yet.
However, for your suggestion of "more letter for language or country code", that will surely be a bad idea. ISO standard already defined how many characters a country / language code can be. For example ISO 3166-1 alpha-3 standard is 3-char long country code while ISO 3166-1 alpha-2 is 2-char long.
I think you best bet is to choose code that is not used by any country, or choose from some deleted codes so that there is supposed no one using or going to use it. (For example, MID is a deleted code which looks a good fit for Middle Earth grin )
Quenya has already been registered in iso 639-3 with the code "qya" : http://www-01.sil.org/iso639-3/documentation.asp?id=qya
As Kevin Gruber pointed out Quenya is registered. There are a few reserved user-assignable codes available in the 2 character space, these might be acceptable as well, to keep from colliding with valid country codes.
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#ZZ
The following alpha-2 codes can be user-assigned: AA, QM to QZ, XA to
XZ, and ZZ.
ZZ is unofficially used to represent "Unknown or Invalid Territory", which is probably the best fit in this case.
I have a sample data of employees.
Brad Senior
<Fname>Brad Junior</Fname>
CHICAGO, March 6 1990 - He is a great Java Developer.
He has worked in XYZ company.
Data is in the format:
Person's name
<Fname> xxx </Fname> // Optional
Current Location, DOB - Description about his work.
I am able to parse it using BufferedReader and by using so many conditions.
Is there a better way to pares this content (e.g. Regex) and store it in a Employee object?
I cannot use external libraries.
Thanks.
You could use a parser generator as Cup.
It's really useful when the format becomes more complex. It also makes maintenance of the parser easier if the file format is extended.
Better is an opinion so your question is innately unanswerable. As Tobías stated parsers are arguable and your options are vast. I recommend the industry standard ANTLR in opposition to Cup due to the fact that it has absolutely no license information anywhere I could find.
Suppose I've got a sentence that I'm forming dynamically based on data passed in from someone else. Let's say it's a list of food items, like: ["apples", "pears", "oranges"] or ["bread", "meat"]. I want to take these sentences and form a sentence: "I like to eat apples, pears and oranges" and "I like to eat bread and meat."
It's easy to do this for just English, but I'm fairly sure that conjunctions don't work the same way in all languages (in fact I bet some of them are wildly different). How do you localize a sentence with a dynamic number of items, joined in some manner?
If it helps, I'm working on this for Android, so you will be able to use any libraries that it provides (or one for Java).
I would ask native speakers of the languages you want to target. If all the languages use the same construct (a comma between all the elements except for the last one), then simply build a comma separated list of the n - 1 elements, and use an externalized pattern and use java.util.MessageFormat to build the whole sentence with the n - 1 elements string as first argument and the nth element as second argument. You might also externalize the separator (comma in English) to make it more flexible if needed.
If some languages use another construct, then define several Strategy implementations, externalize the name of the strategy in order to know which strategy to use for a given locale, and then ask the appropriate strategy to format the list for you. The strategy for English would use the algorithm described above. The strategy for another language would use a different way of joining the elements together, but it could be reused for several languages using externalized patterns if those languages use the same construct but with different words.
Note that it's because it's so difficult that most of the programs simply do List of what I like: bread, meat.
The problem, of course, is knowing the other languages. I don't know how you'd do it accurately without finding native speakers for your target languages. Google Translate is an option, but you should still have someone who speaks the language proofread it.
Android has this built-in functionality in the Plurals string resources.
Their documentation provides examples in English (res/valules/strings.xml) and also in another language (placed in res/values-pl/strings.xml)
Valid quantities: {zero, one, two, few, many, other}
Their example:
<?xml version="1.0" encoding="utf-8"?>
<resources>
<plurals name="numberOfSongsAvailable">
<item quantity="one">One song found.</item>
<item quantity="other">%d songs found.</item>
</plurals>
</resources>
So it seems like their approach is to use longer sentence fragments than just the individual words and plurals.
Very difficult. Take french for example: it uses articles in front of each noun. Your example would be:
J'aime manger des pommes, des poires et des oranges.
Or better:
J'aime les pommes, les poires et les oranges.
(Literally: I like the apples, the pears and the oranges).
So it's no longer only a problem of conjunctions. It's a matter of different grammatical rules that goes beyond conjunctions. Needless to say that other languages may raise totally different issues.
This is unfortunately nearly impossible to do. That is because of conjugation on one side and different forms depending on gender on the other. English doesn't have that distinction but many (most of?) other languages do. To make the matter worse, verb form might depend on the following noun (both gender-dependent form and conjugation).
Instead of such nice sentences, we tend to use lists of things, for example:
The list of things I would like to eat:
apples
pears
oranges
I am afraid that this is the only reasonable thing to do (variable, modifiable list).
Are there any Java API(s) which will provide plural form of English words (e.g. cacti for cactus)?
Check Evo Inflector which implements English pluralization algorithm based on Damian Conway paper "An Algorithmic Approach to English Pluralization".
The library is tested against data from Wiktionary and reports 100% success rate for 1000 most used English words and 70% success rate for all the words listed in Wiktionary.
If you want even more accuracy you can take Wiktionary dump and parse it to create the database of singular to plural mappings. Take into account that due to the open nature of Wiktionary some data there might by incorrect.
Example Usage:
English.plural("Facility", 1)); // == "Facility"
English.plural("Facility", 2)); // == "Facilities"
jibx-tools provides a convenient pluralizer/depluralizer.
Groovy test:
NameConverter nameTools = new DefaultNameConverter();
assert nameTools.depluralize("apples") == "apple"
nameTools.pluralize("apple") == "apples"
I know there is simple pluralize() function in Ruby on Rails, maybe you could get that through JRuby. The problem really isn't easy, I saw pages of rules on how to pluralize and it wasn't even complete. Some rules are not algorithmic - they depend on stem origin etc. which isn't easily obtained. So you have to decide how perfect you want to be.
considering java, have a look at modeshapes Inflector-Class as member of the package org.modeshape.common.text. Or google for "inflector" and "randall hauch".
Its hard to find this kind of API. rather you need to find out some websservice which can serve your purpose. Check this. I am not sure if this can help you..
(I tried to put word cacti and got cactus somewhere in the response).
If you can harness javascript, I created a lightweight (7.19 KB) javascript for this. Or you could port my script over to Java. Very easy to use:
pluralizer.run('goose') --> 'geese'
pluralizer.run('deer') --> 'deer'
pluralizer.run('can') --> 'cans'
https://github.com/rhroyston/pluralizer-js
BTW: It looks like cacti to cactus is a super special conversion (most ppl are going to say '1 cactus' anyway). Easy to add that if you want to. The source code is easy to read / update.
Wolfram|Alpha return a list of inflection forms for a given word.
See this as an example:
http://www.wolframalpha.com/input/?i=word+cactus+inflected+forms
And here is their API:
http://products.wolframalpha.com/api/