Unable to print Thai string value in Java console
public static void main(String [] args){
String engParam = "Beautiful";
String thaiParam = "สวย";
System.out.println("Output :" + engParam + ":::" + thaiParam);}
Output is showing like:
Output :Beautiful:::à?ªà??à?¢
I think System.out.println will not be able to print the UTF-8 characters with default console settings. Is there any other way available to resolve this issue? help needed.
You don't specify your environment, but this approach worked for me on Windows 10 from within my IDE, and also from the Command window:
First, use a font that supports Thai characters. But also make sure that the font you choose can be set in the Command window, and not just within your IDE. Some can (e.g. Courier Mono Thai), and some can't (e.g. Angsana New). You can mess with the Registry to add font selections, but Courier Mono Thai was available by default, so I used that one.
Once you have identified a font that you can set in the Command window, you can probably use that in your IDE as well if its default font(s) can't handle Thai characters.
Here are the steps to get things working:
Download font Courier Mono Thai. You can download it from several web sites but I got it from here.
Install the downloaded font. On Windows 10 all you have to do is select it (Courier_MonoThai.ttf) in File Explorer, right click, and select Install from the context menu.
Once the font is installed, make it the default font in the Command window. Open a Command window, click the icon in the top right corner, select Properties and then select Courier Mono Thai as your font:
Run the application in your IDE. If the source code or the output don't render the Thai characters correctly, change the font. I used Courier Mono Thai in NetBeans, and everything looked good:
Finally run in the Command window. The Thai characters probably won't render correctly. To fix that just change the code page to the one that supports Thai (chcp 874) before running your application:
These instructions are specific to Windows 10. If you are running in a different environment update your question with full details of your platform and your IDE.
Updated 12/15/19 to provide an alternative approach:
Instead of using Code page 874 (Thai) from the Command window, you could do this instead:
Create a PrintStream that uses the UTF-8 charset, and write the output using that PrintStream.
In the Command window, use code page 65001 (UTF-8).
Here's the code:
package thaicharacters;
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;
import java.nio.charset.StandardCharsets;
public class ThaiCharacters {
public static void main(String[] args) throws UnsupportedEncodingException {
String engParam = "Beautiful";
String thaiParam = "สวย";
// Write the output to a UTF-8 PrintStream:
PrintStream ps = new PrintStream(System.out, true, StandardCharsets.UTF_8.name());
ps.println("UTF-8: " + engParam + ":::" + thaiParam);
}
}
And here's the output in the Command window, showing that:
The Thai characters are not rendered correctly when using the default code page (437), or the Thai code page (874).
The Thai characters render correctly using the UTF-8 code page (65001):
One cannot easily change a Windows' console encoding. So write to a .txt file.
For Windows to detect the Unicode UTF-8 encoding, you could write at the beginning an invisible BOM character: "\ufeff".
String text = "\uFEFF" + "Output :" + engParam + ":::" + thaiParam;
Path path = Paths.get("temp.txt");
Files.write(path, Collections.singletonList(text)); // Writes in UTF-8
The problem in not in Java. When converted in UTF-8, the thai string "สวย" gives the bytes '0xe0', '0xb8', '0xaa', '0xe0', '0xb8', '0xa7', '0xe0', '0xb8', '0xa2'
In Latin1, 0xe0 is à, 0xaa is ª, oxa2 is ¢, and the others have no representation giving the ? characters.
That means that the println has done its part of the job but that the thing that should have displayed the characters (terminal screen or IDE) cannot or was not instructed to process UTF8.
Unfortunately, the Windows console is not really Unicode friendly. Recent versions (>= Win 7) support a so called utf-8 code page (chcp 65001) which correctly processes UTF-8 byte strings provided its underlying charset can display the characters. For example after typing chcp 65001 my French system successfully displays all accented characters (éèùïêçàâ...) when they are UTF-8 encoded, but cannot display your example Thai string.
If you need a truely UTF-8 capable console on Windows, you can try the excellent ConEmu.
This answer to a similar question might be your case, if you are using eclipse (but it can be almost the same in IntelliJ)
This answer assumes that:
You are using Windows.
The "Java console" you said is an invoke of Command Prompt (You may know nothing about this if you are using an IDE, but cmd and IntelliJ IDEA surely does, though I don't know whether Eclipse or other does).
My guess was right :-)
Go to Registry Editor (regedit), locate at "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Command Processor", create a REG_EXPAND_SZ named AutoRun with value chcp 65001. Then try again (no reboot required).
Actually, this is an example of creating and using an "initscript" for cmd.exe. It may be the way for us to change the de facto "default" console encoding to UTF-8 (codepage 65001) without changing too much of the system configurations.
To restore it, simply delete this specified value.
Set environment variable java_tool_options=-Dfile.encoding=utf8 in cmd use chcp 65001
Related
I am running the following java program in eclipse. My string contain latin character.When I am printing the string it looks some weird. Here my code is
String sample = "tést";
System.out.println(sample);
Output:
t?st
please help me. thanks in advance
The actual string will contain the latin character since Java strings are UTF-16. You could verify this with a good debugger.
It's the rendering on your console of the println call that is at fault.
Java does support Latin characters. Your display font at output probably doesn't.
Other option is that you have strange (non-UTF) encoding in eclipse.
I use netbeans, it works fine for me. See my output.
run:
tést
BUILD SUCCESSFUL (total time: 1 second)
You should use charset encoding as ISO-8859-1. This charset supports latin characters.
This will help you
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(sample);
If you're running this in Eclipse and reading the output from Eclipse's console, try this:
Open Run Configurations (Menu Run > Run Configurations)
Find the run configuration you're using
Go to Common tab for that configuration
In Encoding section choose UTF-8
Consider the following program.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class HelloWorld {
public static void main(String[] args) {
System.out.println(Charset.defaultCharset());
char[] array = new char[3];
array[0] = '\u0905';
array[1] = '\u0905';
array[2] = '\u0905';
CharBuffer charBuffer = CharBuffer.wrap(array);
Charset utf8 = Charset.forName("UTF-8");
ByteBuffer encoded = utf8.encode(charBuffer);
System.out.println(new String(encoded.array()));
}
}
When I execute this using terminal,
java HelloWorld
I get properly encoded, shaped text. Default encoding was MacRoman.
Now when I execute the same code from Eclipse, I see incorrect text getting printed to the console.
When I change the file encoding option of Eclipse to UTF-8, it prints correct results in Eclipse.
I am wondering why this happens? Ideally, file encoding options should not have affected this code because here I am using UTF-8 explicitly.
Any idea why this is happening?
I am using Java 1.6 (Sun JDK), Mac OSx 10.7.
You need to specify what encoding you want to use when creating the string:
new String(encoded.array(), charset)
otherwise it will use the default charset.
Make sure the console you use to display the output is also encoded in UTF-8. In Eclipse for example, you need to go to Run Configuration > Common to do this.
System.out.println("\u0905\u0905\u0905");
would be the straight-forward usage.
And encoding is missing for the String constructor, defaulting to the set default encoding.
new String(encoded.array(), "UTF-8")
This happens because Eclipse uses the default ANSI encoding, not UFT-8. If your using a different encoding than what your IDE is using, you will get unreadable results.
you need to change your console run configuration.
click on "Run"
click on "Run Configurations" and then click on "common" tab
change Encoding to UTF
I have tried the following:
System.out.println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "UTF-8"), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "ISO-8859-1"), true).println("rājshāhi");
Which yields the following output:
r?jsh?hi
r?jsh?hi
rÄ?jshÄ?hi
r?jsh?hi
So, what am I doing wrong?
Thanks.
P.S.
I am using Eclipse Indigo on Windows 7. The output goes to the Eclipse output console.
The java file must be encoded correctly. Look in the properties for that file, and set the encoding correctly:
What you did should work, even the simple System.out.println if you have a recent version of eclipse.
Look at the following:
The version of eclipse you are using
Whether the file is encoded correctly. See #Matthew's answer. I assume this would be the case because otherwise eclipse wouldn't allow you to save the file (would warn "unsupported characters")
The font for the console (Windows -> Preferences -> Fonts -> Default Console Font)
When you save the text to a file whether you get the characters correctly
Actually, copying your code and running it on my computer gave me the following output:
rājshāhi
rājshāhi
rājshāhi
r?jsh?hi
It looks like all lines work except the last one. Get your System default character set (see this answer). Mine is UTF-8. See if changing your default character set makes a difference.
Either of the following lines will get your default character set:
System.out.println(System.getProperty("file.encoding"));
System.out.println(Charset.defaultCharset());
To change the default encoding, see this answer.
Make sure when you are creating your class Assign the Text file Encoding Value UTF-8.
Once a class is created with any other Text File Encoding later on you can't change the Encoding syle even though eclipse will allow you it won't reflect.
So create a new class with TextFile Encoding UTF 8.It will work definitely.
EDIT: In your case though you are trying to assing Text File encoding programatically it is not making any impact it is taking the container inherited encoding (Cp1252)
Using Latest Eclipse version helped me to achive UTF-8 encoding on console
I used Luna Version of Eclipse and set Properties->Info->Others->UTF-8
The following simple test is failing:
assertEquals(myStringComingFromTheDB, "£");
Giving:
Expected :£
Actual :£
I don't understand why this is happening, especially considering that is the encoding of the actual string (the one specified as second argument) to be wrong. The java file is saved as UTF8.
The following code:
System.out.println(bytesToHex(myStringComingFromTheDB.getBytes()));
System.out.println(bytesToHex("£".getBytes()));
Outputs:
C2A3
C382C2A3
Can anyone explain me why?
Thank you.
Update: I'm working under Windows 7.
Update 2: It's not related to JUnit, the following simple example:
byte[] bytes = "£".getBytes();
for(byte b : bytes)
{
System.out.println(Integer.toHexString(b));
}
Outputs:
ffffffc3
ffffff82
ffffffc2
ffffffa3
Update 3:
I'm working in IntelliJ Idea, I already checked the options and the encoding is UTF8. Also, it's written in the bottom bar and when I select and right click the pound sign it says "Encoding (auto-detected): UTF-8".
Update 4:
Opened the java file with a hex editor and the the pound sign is saved, correctly, as "C2A3".
Please note that assertEquals accepts parameters in the following order:
assertEquals(expected, actual)
so in your case string coming from DB is ok, but the one from your Java class is not (as you noticed already).
I guess that you copied £ from somewhere - probably along with some weird characters around it which your editor (IDE) does not print out (almost sure). I had similar issues couple of times, especially when I worked on MS Windows: e.g. ctrl+c & ctrl+v from website to IDE.
(I printed bytes of £ on my system with UTF8 encoding and this is C2A3):
for (byte b: "£".getBytes()) {
System.out.println(Integer.toHexString(b));
}
The other solution might be that your file is not realy UTF-8 encoded. Do you work on Windows or some other OS?
Some other possible solutions according to the question edits:
1) it's possible that IDE uses some other encoding. For eclipse see this thread: http://www.eclipse.org/forums/index.php?t=msg&goto=543800&
2) If both IDE settings and final file encodings are ok, than it's compiler issue. See:
Java compiler platform file encoding problem
I saved my Java source file specifying it's encoding type as UTF-8 (using Notepad, by default Notepad's encoding type is ANSI) and then I tried to compile it using:
javac -encoding "UTF-8" One.java
but it gave an error message"
One.java:1: illegal character: \65279
?public class One {
^
1 error
Is there any other way, I can compile this?
Here is the source:
public class One {
public static void main( String[] args ){
System.out.println("HI");
}
}
Your file is being read as UTF-8, otherwise a character with value "65279" could never appear. javac expects your source code to be in the platform default encoding, according to the javac documentation:
If -encoding is not specified, the platform default converter is used.
Decimal 65279 is hex FEFF, which is the Unicode Byte Order Mark (BOM). It's unnecessary in UTF-8, because UTF-8 is always encoded as an octet stream and doesn't have endianness issues.
Notepad likes to stick in BOMs even when they're not necessary, but some programs don't like finding them. As others have pointed out, Notepad is not a very good text editor. Switching to a different text editor will almost certainly solve your problem.
Open the file in Notepad++ and select Encoding -> Convert to UTF-8 without BOM.
This isn't a problem with your text editor, it's a problem with javac !
The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden !
If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.
The proposed solution of removing the BOM is only a workaround and not the proper solution.
This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058
Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.
Try javac -encoding UTF8 One.java
Without the quotes and it's UTF8, no dash.
See this forum thread for more links
See Below
For example we can discuss with an Program (Telugu words)
Program (UnicodeEx.java)
class UnicodeEx {
public static void main(String[] args) {
double ఎత్తు = 10;
double వెడల్పు = 25;
double దీర్ఘ_చతురస్ర_వైశాల్యం;
System.out.println("The Value of Height = "+ఎత్తు+" and Width = "+వెడల్పు+"\n");
దీర్ఘ_చతురస్ర_వైశాల్యం = ఎత్తు * వెడల్పు;
System.out.println("Area of Rectangle = "+దీర్ఘ_చతురస్ర_వైశాల్యం);
}
}
This is the Program while saving as "UnicodeEx.java" and change Encoding to "unicode"
**How to Compile**
javac -encoding "unicode" UnicodeEx.java
How to Execute
java UnicodeEx
The Value of Height = 10.0 and Width = 25.0
Area of Rectangle = 250.0
I know this is a very old thread, but I was experiencing a similar problem with PHP instead of Java and Google took me here. I was writing PHP on Notepad++ (not plain Notepad) and noticed that an extra white line appeared every time I called an include file. Firebug showed that there was a 65279 character in those extra lines.
Actually both the main PHP file and the included files were encoded in UTF-8. However, Notepad++ has also an option to encode as "UTF-8 without BOM". This solved my problem.
Bottom line: UTF-8 encoding inserts here and there this extra BOM character unless you instruct your editor to use UTF8 without BOM.
Works fine here, even edited in Notepad. Moral of the story is, don't use Notepad. There's likely a unprintable character in there that Notepad is either inserting or happily hiding from you.
I had the same problem. To solve it opened the file in a hex editor and found three "invisible" bytes at the beginning of the file. I removed them, and compilation worked.
Open your file with WordPad or any other editor except Notepad.
Select Save As type as Text Document - MS-DOS Format
Reopen the Project
To extend the existing answers with a solution for Linux users:
To remove the BOM on all .java files at once, go to your source directory and execute
find -iregex '.*\.java' -type f -print0 | xargs -0 dos2unix
Requires find, xargs and dos2unix to be installed, which should be included in most distributions. The first statement finds all .java files in the current directory recursively, the second one converts each of them with the dos2unix tool, which is intended to convert line endings but also removes the BOM.
The line endings conversion should have no effect as it should already be in Linux \n format on Linux if you configure your version control correctly but be warned that it does that as well in case you have one of those rare cases where that is not intended.
In the Intellij Idea(Settings>Editor>File Encodings), the project encoding was "windows-1256". So I used the following code to convert static strings to utf8
protected String persianString(String persianStirng) throws UnsupportedEncodingException {
return new String(persianStirng.getBytes("windows-1256"), "UTF-8");
}
Now It is OK!
Depending on the file encoding you should change "windows-1256" to a proper one