For a scientific application I want to design an input form which lets the user enter certain parameters. Some of them are designated using greek letters, some of them have latin letters. The parameter names should be displayed using ordinary JLabel controls.
On Windows, the Tahoma font (which is used for Labels by default) contains both latin and greek letters, so I simply set the Text property of the label to a greek (unicode) string and everything works fine.
I'm wondering if this works also without modifications on Linux and OSX systems resp. for which Java/OS versions this would work.
Also I'm curious if there's an easy way to show subscripts in labels ("\eta_0" in TeX), but this is not that important for my application ...
I have no doubt that the vast majority of Unicode fonts includes the Greek block.
On all platforms, and for all locales.
When there are missing Unicode blocks, it's for space-saving concerns. The 50 or so characters in the Greek block is nothing compared with the thousands of east Asian characters (which my last Linux desktop actually included by default, btw).
Speaking of fancy Unicode: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
Of course, despite any confidence that you or I may have, you should test your application on as many configurations as you can before deploying. Java tries its best, but in practice I've always found a few things that needed tweeking.
#Gunslinger47's answer is dispositive, but you might also look at this game on various target platforms. It displays glyphs from several Unicode character code charts, including Greek.
enum GlyphSet {
ASCII(0x0021, 0x007E), Greek(0x0370, 0x03FF), Letters(0x2100, 0x214F),
Operators(0x2200, 0x22FF), Miscellany(0x2300, 0x23FF), Borders(0x2500, 0x257F),
Symbols(0x2600, 0x26FF), Dingbats(0x2700, 0x27BF), Arrows(0x2900, 0x297F);
...
}
Related
The default thousands separator for Java's Locale.FRANCE DecimalFormat is \u00A0 (non breaking space). That's cool for formatting to avoid having numbers break onto separate lines. But for parsing it would depend on whether French users actually enter that character, or a normal space (\u0020), or either one depending (like maybe pasting from another source). Is there any place I can find data on what is actually done by real users?
Real french users wouldn't type \u00A0, because the keyboard doesn't have a character for it. You're confusing presentation with data. You have to validate the user's input, so it'll be easy to disallow any spaces (or use a component that gets rid of the typing completely).
I'm currently trying to figure out how to get a Character.UnicodeBlock set for a given Locale.
Languages need differents characters from one to another.
What I'm exactly trying to achieve is having a String containing every character needed to write in a specific language. I can then use this String to precompute a set of OpenGL textures from a TrueTypeFont file, so I can easily write any text in any language.
Precaching every single character and having around 1000000 textures is of course not an option.
Does anyone have an idea ? Or does anyone see a flaw in this procedure ?
It's not as simple as that. Text in most European languages can often be written with a simple set of precomposed Unicode characters, but for many more complex scripts you need to handle composing characters. This starts fairly easily with combining accents for Western alphabets, progresses through Arabic letters that are context-sensitive (they have different shapes depending on whether they are first, last, or in the middle of a word), and ends with the utter madness that is found in many Indic scripts.
The Unicode Standard has chapters about the intricacies involved in rendering the various scripts it can encode. Just sample, for example, the description of Tibetan early in chapter 10, and if that doesn't scare you away, flip back to Devanagari in chapter 9. You will quickly drop your ambition of being able to "write text in any language". Doing so correctly requires specialized rendering software, written by experts deeply familiar with the scripts in question.
How big task is it to implement support for Arabic localization, our Java 1.5 Applet was designed as fully localizable (european languages) but now we plan to add also arabic as a new language.
We are using custom GUI text i/o components inherited from Component class using e.g. Drawstring, how well is arabic supported within Component class ?
The keyboard input is done with KeyListener getKeyChar, getKeyCode etc.
It depends on the quality of the original internationalization work. If everything is implemented correctly, then it will be similar to adding support for a new European language - most of the work will be translation and testing.
However, if you've only tested the software with European languages, you might find a lot of problems with your original internationalization work. In particular you might need to consider:
bi-directional text
ligatures (joinining the characters)
rendering (characters change shape depending on their position in the word)
number and date formats formats
specialized input methods
cultural differences (for icons etc)
file encodings
testing
If you have custom code that implements software features in a way that isn't fully localizable then you need to budget for fixing this too.
If you have manuals, help text and other collateral that also needs to be translated, then the software cost might not be such a large proportion of the total budget.
Also, if you have plans to perform localization for any Far Eastern languages (Japanese, Chinese, Korean, ...) you might consider sharing the cost across those projects, since many of the issues will be similar.
One final point - maintaining the localization for future releases might cost substantially more than providing it in the first place.
I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.
Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.
Of course no method will be perfect.
There are usually two classes of text extraction poblems :
1 - nothing gets extracted.
This can be because you've got a scanned document or something is invalid in the PDF.
Usually easy to detect, you should not need complicaed code to check those.
2 - You get garbage.
Most of the times because the PDF file is weirdly encoded.
This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).
I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...
About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !
You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.
I have a PC notebook running Win Vista, when I first bought it, certain Chinese fonts won't show up, I could only see rectangles, but I played with the control setting for a while, changed some properties, and now it shows Chinese fonts correctly, but I don't remember what I did.
Now some of my programs displays both English and Chinese, something like this : "Enter | 输入" (The Chinese here also means enter), but if a user doesn't have Chinese fonts installed properly on his machine, he will see something like this : "Enter | [][]", my question is : in Java how to detect if those characters will show up correctly on a certain machine, if not, just display "Enter", if it is, show "Enter | 输入".
Frank
java.awt.GraphicsEnvironment.getAvailableFontFamilyNames() can give you a list of the available fonts installed on the current system. You could also use java.awt.GraphicsEnvironment.getAllFonts() to get java.awt.Font objects.
Then, you can use java.awt.Font.canDisplay(int) to check whether a Unicode character can be displayed in that font (where the int is the integer representation of the multibyte character).
Lazy version:
Arrays.asList(GraphicsEnvironment.getLocalGraphicsEnvironment().getAvailableFontFamilyNames()).contains(FONT_NAME)
For those who are still interested. Performance tip: using getAvailableFontFamilyNames(Locale.ROOT) might be significantly faster than just getAvailableFontFamilyNames() because in the latter case locale-aware processing is performed.