I am making an internationalized app in Java. I need a list of all the letters in a language, starting from the Locale. There are some questions like Alphabet constant in Java? or Create Alphabet List from list : Java which touch on the issue, but I'm wondering is there a Utils class or something where it's already defined and where I can get a list of chars or a String containing all the letters in the alphabet of a language by it's Locale.
You can refer this library and methods in detail, com.ibm.icu.util.LocaleData. Pass argument as Locale.ENGLISH to get alphabets of English.
There are several issues here.
First, I have to point out that there are many languages that aren't alphabetic. Obviously, Chinese, or Japanese are examples of ideographic languages. Unfortunately, it will be very hard, next to impossible to create a list of all the characters in these languages.
Second, although Common Locale Data Repository and as a consequence ICU have predefined sets of index exemplars and example characters this information is far from being complete.
Third, there are languages that use more than one script (aka writing system). Depending on the source of your locale you may or may not know which characters needs to be displayed.
Finally, it is hard to give you right answer when you haven't provided your use case. The design of your application may impose serious limitations on usability or localizability...
Related
I'm currently in the interface design process of developing another Android app and once again I seem to be trying to use reserved words for the resources (be it drawables and layouts). To my knowledge there are a set of rules you need to know:
No uppercase is allowed.
No symbols apart from underscore.
No numbers
Appart from those (please correct me if I'm wrong) I think you can't use any of the reserver words from JAVA which after a little googling appear to be the following:
So my question would be if there is somewhere in the docs that I've failed to locate, that explains in detail what we can and can not use for resource names. This is right after reading the page about resources so its possible that I'm simply worthless in reading.
Source for the reserved words
To my knowledge there are a set of rules you need to know:
No uppercase is allowed.
AFAIK, that is not a rule. A convention is to use all lowercase, but mixed case has worked.
NOTE: In layouts you can ONLY use lowercase letters (a-z), numbers (0-9) and underscore (_).
No symbols apart from underscore
Correct. More accurately, the name has to be a valid Java data member name, which limits you to letters, numbers, and underscores, and it cannot start with a number.
No numbers
That is not a rule, though your name cannot start with a number, as noted above.
Appart from those (please correct me if I'm wrong) I think you can't use any of the reserver words from JAVA which after a little googling appear to be the following:
That is because reserved words are not valid Java data member names.
So my question would be if there is somewhere in the docs that I've failed to locate, that explains in detail what we can and can not use for resource names
Apparently not.
Well my answer would be a mix of some pages where you can find what you need.
1.- First i would recommend you to read the Conventions that Oracle recommends for java
NOTE: especially the section of "Naming Conventions" (this is something that most of the other answers have), after that i would suggest you to read the "Java Languages Keywords" cause you can not use any of those words, BUT remember that JAVA is CASE-SENSITIVE, so if you write "Abstract" instead of "abstract" then is OK, but of course that may confuse someone later one (maybe yourself).
2.- Last but not least you can read the "Code Style Guidelines", this are the conventions that contributors to the android source code need to apply to their code to be accepted.
If you follow this rules, your code not only will be valid (Of course this is important), is going to be more readable for you and others, and if another person needs to make some modification later on, that would be a easier task than if you just start typing random names like "x1, x2, X1, _x1, etc..."
OTHER USEFUL ARTICLE:
If you are starting your app, then this article is going to be very useful for you, it explains why the use of setters and getters in a exaggerated way is a very bad practice, they need to be ONLY if is needed not just for setting and getting every variable in your object.
If you use identifiers that are valid Java variable names (this means consist only of a-z, A-Z, 0-9 and the underscore characters) you will not have any problems. The actual namespace is probably larger, but this works for me.
documentation
I'll just chip in and say this:
You can't use keywords but managing android resources isn't quite easy either... for instance, you cannot have different folders for drawables, they need to go to drawable-xxxx folder...
So, try to come up with sensible prefixes for your drawables and selectors.
Android accepts all valid Java variable names so I don't really see where this question comes from.
The following code is very well known to convert accented chars into plain Text:
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll
1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)
Thanks.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.
For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.
Blocks are nearly never what you want.
In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.
That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.
You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.
Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.
However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.
Took me a while, but I fished them all out:
Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])
Hope this saves you some time.
I am a Computer Science student and working on a project based on the Nutch search engine. I want to develop Java algorithms to better index and search Arabic websites. How can I optimize for this purpose, any ideas?
Arabic language has 29 alphabets, some of these alphabets are having sub alphabets like the Alif (أ) which can come in different forms.
if you managed to be sub alphabet tolerant i.e. to allow spelling mistakes on these characters
e.g. أحمد and احمد and إحمد and آحمد although they have different UTF8 values, you can take them as close results.
moreover, if you can derive roots from words to allow searching for singulars, plurals, verbs, nouns, etc.
so if someone typed قال (said) you can include in the searched terms the words قول (saying) and (يقول) (to say) and مقال (a saying), etc.
it will require a complicated engine to do such thing
finally, if you consider tashkeel (decorating vowels) that are optional in typing where you could take as a more specific search but would allow ignoring it
e.g. رجل could match رَجُلٌ (meaning a man) or رَجَلَ (meaning walked on feet) or رِِِِِجْل (leg)
I hope this would help
Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?
There are various techniques, and a robust method would combine various ones:
look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b
To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):
Locale.ENGLISH,
"he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
"the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
Locale.FRENCH,
"es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
"de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
Locale.ITALIAN,
"re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
"di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",
(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)
As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.
If you want a really quick-and-dirty way of applying the above, try:
consider each sequence of three characters in your text (replacing word boundaries with '_')
for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
at the end, assume the language is that with the highest score
optionally, do the same for the common words (combine scores)
Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".
Did you tried Apache Tika. It has good API to detect language and It can also support different language by loading respective profile.
You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.
Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.
If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.
In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.
Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.
There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.
However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.
Good Luck.
I would like to determine what the alphabet for a given locale is, preferably based on the browser Accept-Language header values. Anyone know how to do this, using a library if necessary ?
take a look at [LocaleData.getExemplarSet][1]
for example for english this returns abcdefghijklmnopqrstuvwxyz
[1]: http://icu-project.org/apiref/icu4j/com/ibm/icu/util/LocaleData.html#getExemplarSet(com.ibm.icu.util.ULocale, int)
If you just want to know the name of an appropriate character set for a users locale then you might try the nio.CharSet class.
If you really want to use the Accept-Language header, then there's an old O'Reilly article on this matter which introduces a pretty handy class called LanguageNegotiator.
I think one of those will give you a decent enough start.
It depends on how specific you want to get. One place to look would be at the "Suppress-Script" properties in the IANA language registry.
Some languages have multiple "alphabets" that can be used for writing. For example, Azerbaijani can be written in Latin or Arabic script. Most languages, like English, are written almost exclusively in a single script, so the correct script goes without saying, and should be "suppressed" in language codes.
So, looking at the entry for Russian, you can tell that the preferred script is Cyrillic, while for Ethiopian, it is Amharic. But German, Norwegian, and English aren't more specific than "Latin". So, with this method, you'd have a hard time hiding umlauts and thorns from Americans, or offering any script to a Kashmiri writer.
This is an English answer written in Århus. Yesterday, I heard some Germans say 'Blödheit, à propos, ist dumm'. However, one of them wore a shirt that said 'I know the difference between 文字 and الْعَرَبيّة'.
What's the answer to your question for this text? Is it allowed? Isn't this an English text?
The International Components for Unicode might help here. Specifically the UScript class looks promising.
Out of curiosity: What do you need it for?