Java encodings for Japanese - java

Our software has a script that creates different language JAR files, for Japanese we use the encoding SJIS in a call to native2asci. This worked last time a Japanese build was attempted but now seems to only work in certain contexts. For example in the following dialog the encoding seems to only work in the title bar:
Anyone have any idea about what might be causing this? Could this problem be related to a change in Java?

What exactly do you pass through native2ascii? Just to make sure, you're using native2ascii -encoding Shift_JIS, right? And you're passing text files or source files through native2ascii, right?
My only other idea is that after the text has been converted to \uXXXX format, the font you're using to display the dialog may not have all the Kanji and Kana. Explicitly set a font, and try that.

I would suggest checking these 2 things:
Make absolutely sure that the native2ascii conversions are correct. You should do a round trip conversion with the -reverse flag, and make sure that your input and output are in sync.
Double-check that your fonts used can support Shift-JIS. Those blocks and symbols that appear in the dialog text and button text look like the characters might be OK, but the fonts might not support them.
An additional word of caution: If this application is intended for use on Windows, then you really should be using the MS932 or windows-31j encoding. SJIS will work for all but a dozen or so symbols, but it turns out these symbols (like the full-width tilde) are actually used quite frequently in Japan.

I think the right way to do this is to use UTF-8 or UTF-16 exclusively. Kanji and Katakana demand special attention.

Related

how to compare a value's encoding of string type with a specific encoding in java?

I'm told to write a code that get a string text and check if its encoding is equal the specific encoding that we want or not. I've searched a lot but I didn't seem to find anything. I found a method (getEncoding()) but it just works with files and that is not what I want. and also I'm told that i should use java library not methods of mozilla or apache.
I really appreciate any help. thanks in advance.
What you are thinking of is "Internationalization". There are libraries for this like, Loc4j, but you can also get this using java.util.Locale in Java. However in general text is just text. It is a token with a certain value. No localization information is stored in the character. This is why a file normally provides the encoding in the header. A console or terminal can also provide localization using certain commands/functions.
Unless you know the source encoding and the token used you will have a limited ability to guess what encoding is used in the other end. If you still would want to do this you will need to go into deeper areas such as decryption where this kind of stuff usually is done using statistic analysis. This in turn requires databases on the usage of different tokens and depending on the quality of the text, databases and algorithms a specific amount of text is required. Special stuff, like writing Swedish with eg. US encoding (like using a for å and ä or o for ö) will require more advanced analysis.
EDIT
Since I got a comment that encoding and internationalization is different entities I will add some comments. It is possible to work with different encodings working plainly with English (like some English special characters). It is also possible to work with encodings using for example Charset. However for many applications using different encodings it may still be efficient to use Locale, since this library can do a lot of operations on text with different encodings.
Thanks for ur answers and contribution but these two link did the trick. I had already seen these two pages but it didn't seem to work for me cause I was thinking about get the encoding directly and then compare it with the specific one.
This is one of them
This is another one.

How to print Unicode symbols U+2610 and U+2612 to Windows console with Java?

What I do:
public class Main {
public static void main(String[] args) {
char i = 0x25A0;
System.out.println(i);
i = 0x2612;
System.out.println(i);
i = 0x2610;
System.out.println(i);
}
}
What I get in IDE:
What I get in IDE
What I get in Windows console:
What I get in Windows console
I have Windows 10 (Russian locale), Cp866 default coding in console, UTF-8 coding in IDE.
How to make characters in console look correct?
Two problems here, actually:
Java converts output to its default encoding which doesn't have anything to do with the console encoding, usually. This can apparently only be overridden at VM startup with, e.g.
java -Dfile.encoding=UTF-8 MyClass
The console window has to use a TrueType font in order to display Unicode. However, neither Consolas, nor Lucida Console have ☐, or ☒. So they show up as boxes with Lucida Console and boxes with a question mark with Consolas (i.e. the missing glyph glyph). The output is still fine, you can copy/paste it easily, it just doesn't look right, and since the Windows console doesn't use font substitution (hard to do that with a character grid anyway), there's little you can do to make them show up.
I'd probably just use [█], [ ], and [X] instead.
Cp866 default coding in console
well yeah. Code page 866 doesn't include characters U+25A0, U+2610 or U+2612. So even if Java were using the correct encoding for the console (either because you set something like -Dfile.encoding=cp866, or it guessed the right encoding, which it almost never manages), you couldn't get the characters out.
How to make characters in console look correct?
You can't.
In theory you could use -Dfile.encoding=utf-8, and set the console encoding to UTF-8 (or near enough, code page 65001). Unfortunately the Windows console is broken for multi-byte encodings (other than the legacy locale-default supported ones, which UTF-8 isn't); you'll get garbled output and hangs on input. This approach is normally unworkable.
The only reliable way to get Unicode to the Windows console is to skip the byte-based C-standard-library I/O functions that Java uses and go straight to the Win32 native WriteConsoleW interface, which accepts Unicode characters (well, UTF-16 code units, same as Java strings) and so avoids the console bugs in byte conversion. You can use JNA to access this API—see example code in this question: Java, UTF-8, and Windows console though it takes some extra tedious work if you want to make it switch between console character output and regular byte output for command piping.
And then you have to hope the user has non-raster fonts (as #Joey mentioned), then then you have to hope the font has glyphs for the characters you want (Consolas doesn't for U+2610 or U+22612). Unless you really really have to, getting the Windows console to do Unicode is largely a waste of your time.
Are you sure, that the font you use, has characters to display the Unicode? No font supports every possible Unicode character. U+9744,9632 and 9746 are not supported by e.g. the Arial font. You can Change the font of your IDE console and your Windows console too.

How to read textfiles with unknown encoding?

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?
Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.
You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.
It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Java - How can I use Arabic characters?

I'm creating a flashcard-type game to help in learning a new language. Now the actual language I'm trying to use in my program is Urdu, but when I look at a unicode chart Arabic and Urdu letters are mixed together and I thought more people would know what I'm talking about if I said Arabic.
So, on my Windows 8 machine I can change the keyboard layout to Urdu and whatever I type into Java is correctly displayed back to me. However transferring this code to another computer with Windows 7 (at my school) changes the Urdu characters in the raw Java file to odd-characters/mumbo-jumbo. Coping and pasting the character from the online unicode chart displays in the java file, but is shown as a '?' in the actual program itself, and in the System.out method.
Now when I use the unicode escape commands (ex. \uXXXX) these are displayed correctly on both computers.
The problem is that I don't want to use escape commands every time I want to write something in Urdu. I plan on writing long sentences and many words. So I was thinking of making an array of the unicode codes and then perhaps a method that converts a English string of letters into Urdu using this array but I thought there must be an easier way to fix this problem.
I'm still kinda a beginner, but I wasn't planning on making a very complex program anyway. For any help, thanks.
This sounds like a problem with the encoding in your compiler on the Windows 7 computer. You should make sure that both computers are using encoding that supports arabic/urdu characters, such as UTF-8, when compiling.
If this is not specified, the compiler will use the system's default encoding which might not support arabic/urdu characters. See this link for information on how to find/set encoding properties.
You can get the encoding currently used for compiling by adding this piece of code:
System.out.println(System.getProperty("file.encoding"));

How to get correct encoding?

I have utf-8 file which I want to read and display in my java program.
In eclipse console(stdout) or in swing I'm getting question marks instead of correct characters.
BufferedReader fr = new BufferedReader(
new InputStreamReader(
new FileInputStream(f),"UTF-8"));
System.out.println(fr.readLine());
inpuStreamReader.getEncoding() //prints me UTF-8
I generally don't have problem displaying accented letters either on the linux console or firefox etc.
Why is that so? I'm ill from this :/
thank you for help
I'm not a Java expert, but it seems like you're creating a UTF-8 InputStreamReader with a file that's not necessarily UTF-8.
See also: Java : How to determine the correct charset encoding of a stream
It sounds like the Eclipse console is not processing UTF-8 characters, and/or the font configured for that console does not support the Unicode characters you are trying to display.
You might be able to get this to work if you configure Eclipse to expect UTF-8 characters, and also make sure that the font in use can display those Unicode characters that are encoded in your file.
From the Eclipse 3.1 New and Noteworthy page:
You can configure the console to
display output using a character
encoding different from the default
using the Console Encoding settings on
the Common tab of a launch
configuration.
As for Swing, I think you're going to need to select the right font.
There are several parameters at work, when the system has to display Unicode characters -
The first and foremost that comes to the mind, is the encoding of the input stream or buffer, which you've already figured out.
The next one in the list is the Unicode capabilities of the application - Eclipse does support display of Unicode characters in the console output; with a workaround :).
The last one in my mind is that of the font used in you console output - not all fonts come with glyphs for displaying Unicode characters.
Update
The non-display of Unicode characters is most likely due to the fact that Cp1252 is used for encoding characters in the console output. This can be modified by visiting the Run configuration of the application - it appears in the Common tab of the run-time configuration.

Categories