Finding UnicodeBlock set for a given Locale - java

I'm currently trying to figure out how to get a Character.UnicodeBlock set for a given Locale.
Languages need differents characters from one to another.
What I'm exactly trying to achieve is having a String containing every character needed to write in a specific language. I can then use this String to precompute a set of OpenGL textures from a TrueTypeFont file, so I can easily write any text in any language.
Precaching every single character and having around 1000000 textures is of course not an option.
Does anyone have an idea ? Or does anyone see a flaw in this procedure ?

It's not as simple as that. Text in most European languages can often be written with a simple set of precomposed Unicode characters, but for many more complex scripts you need to handle composing characters. This starts fairly easily with combining accents for Western alphabets, progresses through Arabic letters that are context-sensitive (they have different shapes depending on whether they are first, last, or in the middle of a word), and ends with the utter madness that is found in many Indic scripts.
The Unicode Standard has chapters about the intricacies involved in rendering the various scripts it can encode. Just sample, for example, the description of Tibetan early in chapter 10, and if that doesn't scare you away, flip back to Devanagari in chapter 9. You will quickly drop your ambition of being able to "write text in any language". Doing so correctly requires specialized rendering software, written by experts deeply familiar with the scripts in question.

Related

Localization of text in Android

What is the solution or best practice to display localized text strings in Android?
For example:
The English version text: "You have 1 message" and "You have 3 messages".
Note that the word "message" or "messages" is determined by the integer number.
If this were to be localized in another message, the insertion of the integer number could be at the beginning or the end of the sentence, not necessary in the middle of the sentence.
Further, for languages like Japanese it could better to use the full-width " 3 " to display the number as part of the sentence.
That means, even if I manage all localization text in a strings file, I would still need some kind of logic to calculate the final displayed text.
What is the best practice?
Any library I could use?
I would recommend looking into a i18n lib that has a mature ecosystem, i.e. i18next
There is some android lib too: i.e. i18next-android
It has good support for multiple plural forms too: i18next-android#multiple-plural-forms
Further you should not only consider that you have to instrument your code (i18n) to get your app/website translated. You should think about the process too - how will you solve continuous localization, how you keep track of progress, etc...
For a translation management+ system you might eg. have a look at locize it plays well with all json based i18n frameworks and in the core has a very simpe api... and provides a lot more than traditional systems.

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.
I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.
I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

Handwritten character (English letters, kanji,etc.) analysis and correction

I would like to know how practical it would be to create a program which takes handwritten characters in some form, analyzes them, and offers corrections to the user. The inspiration for this idea is to have elementary school students in other countries or University students in America learn how to write in languages such as Japanese or Chinese where there are a lot of characters and even the slightest mistake can make a big difference.
I am unsure how the program will analyze the character. My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work. Endpoints will also be useful to know. I would also like to tell the user if their character could be interpreted as another character similar to the one they wanted to write.
I imagine I will need a library of some sort to complete this project in any sort of timely manner but I have been unable to locate one which meets the standards I will need for the program. I looked into OpenCV but it appears to be meant for vision than image processing. I would also appreciate the library/module to be in python or Java but I can learn a new language if absolutely necessary.
Thank you for any help in this project.
Character Recognition is usually implemented using Artificial Neural Networks (ANNs). It is not a straightforward task to implement seeing that there are usually lots of ways in which different people write the same character.
The good thing about neural networks is that they can be trained. So, to change from one language to another all you need to change are the weights between the neurons, and leave your network intact. Neural networks are also able to generalize to a certain extent, so they are usually able to cope with minor variances of the same letter.
Tesseract is an open source OCR which was developed in the mid 90's. You might want to read about it to gain some pointers.
You can follow company links from this Wikipedia article:
http://en.wikipedia.org/wiki/Intelligent_character_recognition
I would not recommend that you attempt to implement a solution yourself, especially if you want to complete the task in less than a year or two of full-time work. It would be unfortunate if an incomplete solution provided poor guidance for students.
A word of caution: some companies that offer commercial ICR libraries may not wish to support you and/or may not provide a quote. That's their right. However, if you do not feel comfortable working with a particular vendor, either ask for a different sales contact and/or try a different vendor first.
My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work.
The initial step of getting a stroke representation only a single pixel wide is much more difficult than you might guess. Although there are simple algorithms (e.g. Stentiford and Zhang-Suen) to perform thinning, stroke crossings and rough edges present serious problems. This is a classic (and unsolved) problem. Thinning works much of the time, but when it fails, it can fail miserably.
You could work with an open source library, and although that will help you learn algorithms and their uses, to develop a good solution you will almost certainly need to dig into the algorithms themselves and understand how they work. That requires quite a bit of study.
Here are some books that are useful as introduct textbooks:
Digital Image Processing by Gonzalez and Woods
Character Recognition Systems by Cheriet, Kharma, Siu, and Suen
Reading in the Brain by Stanislas Dehaene
Gonzalez and Woods is a standard textbook in image processing. Without some background knowledge of image processing it will be difficult for you to make progress.
The book by Cheriet, et al., touches on the state of the art in optical character recognition (OCR) and also covers handwriting recognition. The sooner you read this book, the sooner you can learn about techniques that have already been attempted.
The Dehaene book is a readable presentation of the mental processes involved in human reading, and could inspire development of interesting new algorithms.
Have you seen http://www.skritter.com? They do this in combination with spaced recognition scheduling.
I guess you want to classify features such as curves in your strokes (http://en.wikipedia.org/wiki/CJK_strokes), then as a next layer identify componenents, then estimate the most likely character. All the while statistically weighting the most likely character. Where there are two likely matches you will want to show them as likely to be confused. You will also need to create a database of probably 3000 to 5000 characters, or up to 10000 for the ambitious.
See also http://www.tegaki.org/ for an open source program to do this.

Greek/latin scientific JLabel in Java Swing application

For a scientific application I want to design an input form which lets the user enter certain parameters. Some of them are designated using greek letters, some of them have latin letters. The parameter names should be displayed using ordinary JLabel controls.
On Windows, the Tahoma font (which is used for Labels by default) contains both latin and greek letters, so I simply set the Text property of the label to a greek (unicode) string and everything works fine.
I'm wondering if this works also without modifications on Linux and OSX systems resp. for which Java/OS versions this would work.
Also I'm curious if there's an easy way to show subscripts in labels ("\eta_0" in TeX), but this is not that important for my application ...
I have no doubt that the vast majority of Unicode fonts includes the Greek block.
On all platforms, and for all locales.
When there are missing Unicode blocks, it's for space-saving concerns. The 50 or so characters in the Greek block is nothing compared with the thousands of east Asian characters (which my last Linux desktop actually included by default, btw).
Speaking of fancy Unicode: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
Of course, despite any confidence that you or I may have, you should test your application on as many configurations as you can before deploying. Java tries its best, but in practice I've always found a few things that needed tweeking.
#Gunslinger47's answer is dispositive, but you might also look at this game on various target platforms. It displays glyphs from several Unicode character code charts, including Greek.
enum GlyphSet {
ASCII(0x0021, 0x007E), Greek(0x0370, 0x03FF), Letters(0x2100, 0x214F),
Operators(0x2200, 0x22FF), Miscellany(0x2300, 0x23FF), Borders(0x2500, 0x257F),
Symbols(0x2600, 0x26FF), Dingbats(0x2700, 0x27BF), Arrows(0x2900, 0x297F);
...
}

What is a fast and unsupervised way of checking quality of pdf-extracted text?

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.
Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.
Of course no method will be perfect.
There are usually two classes of text extraction poblems :
1 - nothing gets extracted.
This can be because you've got a scanned document or something is invalid in the PDF.
Usually easy to detect, you should not need complicaed code to check those.
2 - You get garbage.
Most of the times because the PDF file is weirdly encoded.
This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).
I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...
About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !
You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.

Categories