How to get correct encoding?

How to get correct encoding? - java

I have utf-8 file which I want to read and display in my java program.
In eclipse console(stdout) or in swing I'm getting question marks instead of correct characters.
BufferedReader fr = new BufferedReader(
new InputStreamReader(
new FileInputStream(f),"UTF-8"));
System.out.println(fr.readLine());
inpuStreamReader.getEncoding() //prints me UTF-8
I generally don't have problem displaying accented letters either on the linux console or firefox etc.
Why is that so? I'm ill from this :/
thank you for help

I'm not a Java expert, but it seems like you're creating a UTF-8 InputStreamReader with a file that's not necessarily UTF-8.
See also: Java : How to determine the correct charset encoding of a stream

It sounds like the Eclipse console is not processing UTF-8 characters, and/or the font configured for that console does not support the Unicode characters you are trying to display.
You might be able to get this to work if you configure Eclipse to expect UTF-8 characters, and also make sure that the font in use can display those Unicode characters that are encoded in your file.
From the Eclipse 3.1 New and Noteworthy page:
You can configure the console to
display output using a character
encoding different from the default
using the Console Encoding settings on
the Common tab of a launch
configuration.
As for Swing, I think you're going to need to select the right font.

There are several parameters at work, when the system has to display Unicode characters -
The first and foremost that comes to the mind, is the encoding of the input stream or buffer, which you've already figured out.
The next one in the list is the Unicode capabilities of the application - Eclipse does support display of Unicode characters in the console output; with a workaround :).
The last one in my mind is that of the font used in you console output - not all fonts come with glyphs for displaying Unicode characters.
Update
The non-display of Unicode characters is most likely due to the fact that Cp1252 is used for encoding characters in the console output. This can be modified by visiting the Run configuration of the application - it appears in the Common tab of the run-time configuration.

Related

JAVA SAPJCO3 codepage not work

After I called the SAP RFC by JAVA, I get the returned data and print it in eclipse. But eclipse console shows the wrong encode character which is not right(returned data language is Traditional Chinese).
enter image description here
My sap codepage is 1100. And I've tried to set the different codepage including 8400,8402,8300, but still not work.
connectProperties.getProperty(DestinationDataProvider.JCO_CODEPAGE, "8400");
How to solve this?

Your question is not about JCo but about how to export character data from Java (always unicode) to some output stream (console) and how to display it.
A console output - maybe even without the capability to display the data with an appropriate font - is a useless test. At least check which default file.encoding is set in your Java system environment.
Instead of this, I'd recommend to write the Java character data explicitly to an UTF-8 encoded file and use an editor that is capable of handling and displaying unicode characters to show the file content. Or if you don't have a Unicode editor, then write a file with converting to the code page fitting to your character data, e.g. for SAP code page 8400, use the charset "GB2312" for your java.io.OutputStreamWriter.

How to print Unicode symbols U+2610 and U+2612 to Windows console with Java?

What I do:
public class Main {
public static void main(String[] args) {
char i = 0x25A0;
System.out.println(i);
i = 0x2612;
System.out.println(i);
i = 0x2610;
System.out.println(i);
}
}
What I get in IDE:
What I get in IDE
What I get in Windows console:
What I get in Windows console
I have Windows 10 (Russian locale), Cp866 default coding in console, UTF-8 coding in IDE.
How to make characters in console look correct?

Two problems here, actually:
Java converts output to its default encoding which doesn't have anything to do with the console encoding, usually. This can apparently only be overridden at VM startup with, e.g.
java -Dfile.encoding=UTF-8 MyClass
The console window has to use a TrueType font in order to display Unicode. However, neither Consolas, nor Lucida Console have ☐, or ☒. So they show up as boxes with Lucida Console and boxes with a question mark with Consolas (i.e. the missing glyph glyph). The output is still fine, you can copy/paste it easily, it just doesn't look right, and since the Windows console doesn't use font substitution (hard to do that with a character grid anyway), there's little you can do to make them show up.
I'd probably just use [█], [ ], and [X] instead.

Cp866 default coding in console
well yeah. Code page 866 doesn't include characters U+25A0, U+2610 or U+2612. So even if Java were using the correct encoding for the console (either because you set something like -Dfile.encoding=cp866, or it guessed the right encoding, which it almost never manages), you couldn't get the characters out.
How to make characters in console look correct?
You can't.
In theory you could use -Dfile.encoding=utf-8, and set the console encoding to UTF-8 (or near enough, code page 65001). Unfortunately the Windows console is broken for multi-byte encodings (other than the legacy locale-default supported ones, which UTF-8 isn't); you'll get garbled output and hangs on input. This approach is normally unworkable.
The only reliable way to get Unicode to the Windows console is to skip the byte-based C-standard-library I/O functions that Java uses and go straight to the Win32 native WriteConsoleW interface, which accepts Unicode characters (well, UTF-16 code units, same as Java strings) and so avoids the console bugs in byte conversion. You can use JNA to access this API—see example code in this question: Java, UTF-8, and Windows console though it takes some extra tedious work if you want to make it switch between console character output and regular byte output for command piping.
And then you have to hope the user has non-raster fonts (as #Joey mentioned), then then you have to hope the font has glyphs for the characters you want (Consolas doesn't for U+2610 or U+22612). Unless you really really have to, getting the Windows console to do Unicode is largely a waste of your time.

Are you sure, that the font you use, has characters to display the Unicode? No font supports every possible Unicode character. U+9744,9632 and 9746 are not supported by e.g. the Arial font. You can Change the font of your IDE console and your Windows console too.

How to read textfiles with unknown encoding?

I want to read several text files (eg CSV), but I don't know the encoding.
As the textfiles may contain special chars like umlauts, chosing the right encoding seems to be crucial.
new BufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
I tried reading with ISO_8859_1 which did not work propertly with umlauts encoded. So I tried UTF-8, which works.
But I don't know in future if this might also cause problems with different files. And I never now before reading a file in which encoding the file is.
So how should I best read files with encoding unknown?

Strictly speaking the other two answers are right - you just have to know what the encoding is to be guaranteed of anything. However, there are libraries out there that will allow you to make educated guesses about the encoding. Check out ICU4J or jchardet, for example.

You have to know the encoding, you cannot read the files correctly if you don't know it. As UTF-8 works just keep using it. Also check with the producer of the files if they will keep producing them in UTF-8. They should document this.

It is impossible to programmatically recognize encoding of a text file. The only way is to try to open it in a text editor with different encodings until you can read the text

Java - How can I use Arabic characters?

I'm creating a flashcard-type game to help in learning a new language. Now the actual language I'm trying to use in my program is Urdu, but when I look at a unicode chart Arabic and Urdu letters are mixed together and I thought more people would know what I'm talking about if I said Arabic.
So, on my Windows 8 machine I can change the keyboard layout to Urdu and whatever I type into Java is correctly displayed back to me. However transferring this code to another computer with Windows 7 (at my school) changes the Urdu characters in the raw Java file to odd-characters/mumbo-jumbo. Coping and pasting the character from the online unicode chart displays in the java file, but is shown as a '?' in the actual program itself, and in the System.out method.
Now when I use the unicode escape commands (ex. \uXXXX) these are displayed correctly on both computers.
The problem is that I don't want to use escape commands every time I want to write something in Urdu. I plan on writing long sentences and many words. So I was thinking of making an array of the unicode codes and then perhaps a method that converts a English string of letters into Urdu using this array but I thought there must be an easier way to fix this problem.
I'm still kinda a beginner, but I wasn't planning on making a very complex program anyway. For any help, thanks.

This sounds like a problem with the encoding in your compiler on the Windows 7 computer. You should make sure that both computers are using encoding that supports arabic/urdu characters, such as UTF-8, when compiling.
If this is not specified, the compiler will use the system's default encoding which might not support arabic/urdu characters. See this link for information on how to find/set encoding properties.
You can get the encoding currently used for compiling by adding this piece of code:
System.out.println(System.getProperty("file.encoding"));

Java encodings for Japanese

Our software has a script that creates different language JAR files, for Japanese we use the encoding SJIS in a call to native2asci. This worked last time a Japanese build was attempted but now seems to only work in certain contexts. For example in the following dialog the encoding seems to only work in the title bar:
Anyone have any idea about what might be causing this? Could this problem be related to a change in Java?

What exactly do you pass through native2ascii? Just to make sure, you're using native2ascii -encoding Shift_JIS, right? And you're passing text files or source files through native2ascii, right?
My only other idea is that after the text has been converted to \uXXXX format, the font you're using to display the dialog may not have all the Kanji and Kana. Explicitly set a font, and try that.

I would suggest checking these 2 things:
Make absolutely sure that the native2ascii conversions are correct. You should do a round trip conversion with the -reverse flag, and make sure that your input and output are in sync.
Double-check that your fonts used can support Shift-JIS. Those blocks and symbols that appear in the dialog text and button text look like the characters might be OK, but the fonts might not support them.
An additional word of caution: If this application is intended for use on Windows, then you really should be using the MS932 or windows-31j encoding. SJIS will work for all but a dozen or so symbols, but it turns out these symbols (like the full-width tilde) are actually used quite frequently in Japan.

I think the right way to do this is to use UTF-8 or UTF-16 exclusively. Kanji and Katakana demand special attention.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.