Apache Tika Language detection: Asian languages enhancement

Apache Tika Language detection: Asian languages enhancement - java

I found out that Apache Tika often gets confused with Asian languages (e.g. Japanese, Chinese) with some small pieces of English.
Text sample:
サンフランシスコ生まれのチョコレートショップ、ダンデライオン・チョコレートからほうじ茶ブレンドの「ホットチョコレートミックス」が新登場。 ドリンクとしてだけでなく、アイスクリームやお餅にかけてもOK。ほうじ茶ホットチョコレート 150g（約5杯分） ￥2,160　©Dandelion Chocolate Japan カカオ豆からチョコレートバーになるまでの全工程を一貫して行う「ビーン・トゥー・バー（Bean to Bar）」メーカーの先駆け的存在、ダンデライオン・チョコレート。その人気商品ホットチョコレートミックスは「自宅でホットチョコレートを飲みたい！」という客からの要望を元に開発されたものだ。ほかのチョコレートバー同様、カカオ豆はシングルオリジンにこだわり、ほろ苦さとフルーティーな酸味のバランスがよいドミニカ共和国産を使用。カカオ本来の豊かな風味を味わえる。 このホットチョコレートミックスが新たに出合ったのが、京都・河原町の日本茶専門店「YUGEN」のほうじ茶だ。宇治を中心に各地の生産者を訪れ、厳選された日本茶のみを扱うYUGEN。ほうじ茶も春に収穫された新芽だけで作られる。その茶葉を石臼でじっくりと挽き、パウダー状にしてブレンドすることで、香ばしさだけでなく、旨味、甘み、栄養成分までそのまま溶け込む、深みのある味わいのホットチョコレートが誕生。 オンラインストアでも購入できるので、チョコレートとほうじ茶のおいしいマリアージュを自宅で堪能して。 ダンデライオン・チョコレート ファクトリー&カフェ蔵前 東京都台東区蔵前4-14-6 tel：03-5833-7270 営）10時〜20時（L.O. 19時30分） 不定休 https://dandelionchocolate.jp ※4/12まで休業中。そのほかの店舗詳細情報はウェブサイトで要確認。 ※この記事に記載している価格は、軽減税率8％の税込価格です。 【関連記事】 産地の魅力が際立つ、誘惑の「Bean to Bar」。 マロン×カカオ、ビーン・トゥ・バー発想のモンブラン。
languageDetector.detect() -> en: MEDIUM (0.570320)
languageDetector.detectAll() -> ArrayList(size=1) en: MEDIUM (0.570320)
I'm using default LanguageDetector implementation: OptimaizeLangDetector
Code sample:
LanguageDetector languageDetector = LanguageDetector.getDefaultLanguageDetector().loadModels();
languageDetector.addText(text);
var result = languageDetector.detect().getLanguage();
I figured out that after splitting a text into small pieces (n-gramms) Tika searches for the match in wordLangProbMap. It's built only from 114k n-gramms for all languages. That leads to the situation when most of the Japanese symbols (n-gramms) are not found in that map, while all english n-gramms are correctly identified. As a result, for either Japanese or Chinese texts Tika often identifies language as a French, German, English and so on.
Is there any way to extend list over which the matching is done?
I'm also considering other alternatives if they could behave in a better way. Previously I used lingua, but it really shows dramatic performance drop and resources usage on big texts.

Related

How to convert 3 letter language code to corresponding text?

Do we have any java libraries to convert 3 letter language code to its corresponding language with localization support?
Like,
ENG -> English
PS: I guess its a bad question. But, google was of not a good help. Hence, turning to you all. Probably, my search term was not accurate.

Use Locale's getDisplayLanguage() method:
Locale eng = Locale.forLanguageTag("ENG"); // Make a locale from language code
System.out.println(eng.getDisplayLanguage()); // Obtain language display name
Demo.

I do not know about a Java library but this might help.
https://www.loc.gov/standards/iso639-2/php/code_list.php
It has the data you are looking for. You might have to scrape it off the page and put it into your Java code.

What is the best way to represent Middle Earth using Java's Locale framework?

I'm in the process of putting together an android app which will have a Quenya translation out of the box (an Elvish dialect).
If we would like to maintain the maximum compliance with the ISO standards while representing this fantasy world, how should we go about it?
Also, if there is a standard for representing Middle Earth that has already been agreed on by the community, what is it?
Perhaps we would:
require more letters for the language or country codes (like "TME" for "Tolkien's Middle Earth" or "MEGN" for "Middle Earth, Gondor")

I am not clear if there are community-agreed standard for such country code yet.
However, for your suggestion of "more letter for language or country code", that will surely be a bad idea. ISO standard already defined how many characters a country / language code can be. For example ISO 3166-1 alpha-3 standard is 3-char long country code while ISO 3166-1 alpha-2 is 2-char long.
I think you best bet is to choose code that is not used by any country, or choose from some deleted codes so that there is supposed no one using or going to use it. (For example, MID is a deleted code which looks a good fit for Middle Earth grin )

Quenya has already been registered in iso 639-3 with the code "qya" : http://www-01.sil.org/iso639-3/documentation.asp?id=qya

As Kevin Gruber pointed out Quenya is registered. There are a few reserved user-assignable codes available in the 2 character space, these might be acceptable as well, to keep from colliding with valid country codes.
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#ZZ
The following alpha-2 codes can be user-assigned: AA, QM to QZ, XA to
XZ, and ZZ.
ZZ is unofficially used to represent "Unknown or Invalid Territory", which is probably the best fit in this case.

About Locale class in Java

I'm using a TTS method on Android which takes a Locale instance as an argument. So I googled the class Locale and found some example code. But I don't understand what is different among usages below because I tested them all with a TTS method and it seems to work all the same to me.
Locale("ja")
Locale("ja_JP")
Locale("ja", "JP", "")
Locale.JAPAN
Locale.JAPANESE
Are there any differences?

The documentation for the Locale class describes this in (almost excruciating) detail. The valid language, country, and variant codes are described in ISO 639.
Here are the differences between the five examples you give:
ja simply describes the Japanese language with no country.
ja_JP specifies both the Japanese language and the country of Japan
The three parameter constructor splits off the language, country, and variant into separate arguments. Locale("ja", "JP", "") is equivalent to Locale("ja_JP") since no variant is provided.
Locale.JAPAN is a constant shortcut for ja_JP (the country of Japan).
Locale.JAPANESE is a constant shortcut for ja (the Japanese language).
What does this all mean? Well, it depends on where it is used. Locales are used with a number of different APIs, including the date-time APIs, text-to-speech APIs, and more.
In the context of text-to-speech, the locale can be used in a number of ways, such as:
Selecting the appropriate voice to use
Applying the proper inflection for certain words. Different locales may speak the same word in the same language differently.
Translating certain non-words into speech. For instance, different locales may speak numbers or fractions differently.
In general, you want to be as specific and accurate as possible when selecting a Locale.

Open Source Text Localization Library

Is there an open source project that handles the process of localizing tokenized string text to other languages, and has complex handling for grammar, spelling (definite, indefinite, plural, singular), also for languages like german handling of masculine, feminine, neuter.
Most localization frameworks do wholesale replace of strings and don't take into account tokenized strings that might refer to objects that in some languages could be masculine/feminine/neuter.
The programming language I'm looking for is Javascript/Java/Actionscript/Python, it'd be nice if there was a programming-language independent data-format for creating the string tables.

To answer your question, I've not heard of any such a framework.
From the limited amount that I understand and have heard about this topic, I'd say that this is beyond the state of the art ... certainly if you are trying to do this across multiple languages.
Here are some relevant resources:
"Open-Source Software and Localization" by Frank Bergman [2005]
Plural forms in GNU gettext - http://www.gnu.org/software/hello/manual/gettext/Plural-forms.html

How do I tell what language is a plain-text file written in? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Suppose we have a text file with the content:
"Je suis un beau homme ..."
another with:
"I am a brave man"
the third with a text in German:
"Guten morgen. Wie geht's ?"
How do we write a function that would tell us: with such a probability the text in the first
file is in English, in the second we have French etc?
Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.
My comments
There's one small comment I need to add. The text may contain phrases in different languages, as part of whole or as a result of a mistake. In classic litterature we have a lot of examples, because the aristocracy members were multilingual. So the probability better describes the situation, as most parts of the text are in one language, while others may be written in another.
Google API - Internet Connection. I would prefer not to use remote functions/services, as I need to do it myself or use a downloadable library. I'd like to make a research on that topic.

There is a package called JLangDetect which seems to do exactly what you want:
langof("un texte en français") = fr : OK
langof("a text in english") = en : OK
langof("un texto en español") = es : OK
langof("un texte un peu plus long en français") = fr : OK
langof("a text a little longer in english") = en : OK
langof("a little longer text in english") = en : OK
langof("un texto un poco mas largo en español") = es : OK
langof("J'aime les bisounours !") = fr : OK
langof("Bienvenue à Montmartre !") = fr : OK
langof("Welcome to London !") = en : OK
// ...
Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.

Language detection by Google: http://code.google.com/apis/ajaxlanguage/documentation/#Detect

For larger corpi of texts you usually use the distribution of letters, digraphs and even trigraphs and compare with known distributions for languages you want to detect.
However, a single sentence is very likely too short to yield any useful statistical measures. You may have more luck with matching individual words with a dictionary, then.

NGramJ seems to be a bit more up-to-date:
http://ngramj.sourceforge.net/
It also has both character-oriented and byte-oriented profiles, so it should be able to identify the character set too.
For documents in multiple languages you need to identify the character set (ICU4J has a CharsetDetector that can do this), then split the text on something resonable like multiple line breaks, or paragraphs if the text is marked up.

Try Nutch's Language Identifier. It is trained with n-gram profiles of languages and profile of available languages is matched with input text. Interesting thing is you can add more languages, if you need.

Look up Markov chains.
Basically you will need statistically significant samples of the languages you want to recognize. When you get a new file, see what the frequencies of specific syllables or phonemes are, and compare the the pre-calculated sample. Pick the closest one.

Although a more complicated solution than you are looking for, you could use Vowpal Wabbit and train it with sentences from different languages.
In theory you could get back a language for every sentence in your documents.
http://hunch.net/~vw/
(Don't be fooled by the "online" in the project's subtitle - that's just mathspeak for learns without having to have whole learning material in memory)

If you are interested in the mechanism by which language detection can be performed, I refer you to the following article (python based) that uses a (very) naive method but is a good introduction to this problem in particular and machine learning (just a big word) in general.
For java implementations, JLangDetect and Nutch as suggested by the other posters are pretty good. Also take a look at Lingpipe, JTCL and NGramJ.
For the problem where you have multiple languages in the same page, you can use a sentence boundary detector to chop a page into sentences and then attempt to identify the language of each sentence. Assuming that a sentence contains only one (primary) language, you should still get good results with any of the above implementations.
Note: A sentence boundary detector (SBD) is theoretically language specific (chicken-egg problem since you need one for the other). But for latin-script based languages (English, French, German, etc.) that primarily use periods (apart from exclamations etc.) for sentence delimiting, you will get acceptable results even if you use an SBD designed for English. I wrote a rules-based English SBD that has worked really well for French text. For implementations, take a look at OpenNLP.
An alternative option to using the SBD is to use a sliding window of say 10 tokens (whitespace delimited) to create a pseudo-sentence (PS) and try and identify the border where the language changes. This has the disadvantage that if your entire document has n tokens, you will perform approximately n-10 classification operations on strings of length 10 tokens each. In the other approach, if the average sentence has 10 tokens, you would have performed approximately n/10 classification operations. If n = 1000 words in a document, you are comparing 990 operations versus 100 operations: an order of magnitude difference.
If you have short phrases (under 20 characters), accuracy of language detection is poor in my experience. Particularly in the case of proper nouns as well as nouns that are same across languages like "chocolate". E.g. Is "New York" an English word or a French word if it appears in a French sentence?

Do you have connection to the internet if you do then Google Language API would be perfect for you.
// This example request includes an optional API key which you will need to
// remove or replace with your own key.
// Read more about why it's useful to have an API key.
// The request also includes the userip parameter which provides the end
// user's IP address. Doing so will help distinguish this legitimate
// server-side traffic from traffic which doesn't come from an end-user.
URL url = new URL(
"http://ajax.googleapis.com/ajax/services/search/web?v=1.0&"
+ "q=Paris%20Hilton&key=INSERT-YOUR-KEY&userip=USERS-IP-ADDRESS");
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", /* Enter the URL of your site here */);
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}
JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...
If you don't there are other methods.

bigram models perform well, are simple to write, simple to train, and require only a small amount of text for detection. The nutch language identifier is a java implementation we found and used with a thin wrapper.
We had problems with a bigram model for mixed CJK and English text (i.e. a tweet is mostly Japanese, but has a single english word). This is obvious in retrospect from looking at the math (Japanese has many more characters, so the probabilities of any given pair are low). I think you could solve this with some more complicated log-linear comparison, but I cheated and used a simple filter based on character sets that are unique to certain languages (i.e. if it only contains unified Han, then it's Chinese, if it contains some Japanese kana and unified Han, then it's Japanese).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.