I have come across a strange issue. In the below piece of code, I am searching for the presence of ß.
public static void main(String[] args) {
char [] chArray = {'ß'};
String str = "Testß";
for(int i=0; i<chArray.length; i++){
if(str.indexOf(chArray[i])>-1){
System.out.println("ß is present");
break;
}
}
}
I have a web application running on JBOSS in linux, Java 6. The above code doesn't detect the presence of ß when include the code in the above specified application.
Surprisingly, if I compile the same file in my eclipse workspace and then apply the patch in the application it runs as expected!
Points to note:
The application build environment is a black-box to me, hence no idea if there is any -encoding option is present for the javac command or something like that
My eclipse's JRE is java8, but the compiler version set for the project is Java6
I changed the value from ß to unicode equivalent to \u00DF in the array declaration, but still the behavior is same.
char [] chArray = {'\u00DF'};
When I decompiled the class file generated the character array declared value was shown as 65533, which is \uFFFD, nothing but replacement character which is used for unidentified symbol. I used JD-GUI as decompiler, which I dont see trustworthy!
Need your help folks! I am sure it is not same as: case sensitive issue of beta Java's equalsIgnoreCase fails with ß ("Sharp S" used in German alphabet)
Thanks in advance
I think your problem is the encoding of ß. You have two options to solve your error:
First convert your java source code into ascii chars, and then compile it:
native2ascii "your_class_file.java"
javac "your_class_file.java"
Compile your java file with your encoding, utf-8 on linux and iso-8859-15 on windows:
javac -encoding "encoding" "your_class_file.java"
As far as I can judge it, it should have worked with replacing "ß" with "\u00df". If the solutions above don't work, print every char and its unicode value to System.out and look which char is 'ß'.
Another error might be that you read the text in an encoding that doesn't support ß; try reading your String by reading the bytes and call:
String input = new String(input_bytes, StandartCharsets.UTF_8); // on linux
String input = new String(input_bytes, StandartCharsets.ISO_8859_1); // on windows
For more information on charsets, see StandartCharsets class reference.
Thanks for your time and responses!
The actual problem was the class file was not generated in the build, hence the change was not reflecting. Using ß's unicode value \u00DF in the java source file should work fine.
Related
I am running the following java program in eclipse. My string contain latin character.When I am printing the string it looks some weird. Here my code is
String sample = "tést";
System.out.println(sample);
Output:
t?st
please help me. thanks in advance
The actual string will contain the latin character since Java strings are UTF-16. You could verify this with a good debugger.
It's the rendering on your console of the println call that is at fault.
Java does support Latin characters. Your display font at output probably doesn't.
Other option is that you have strange (non-UTF) encoding in eclipse.
I use netbeans, it works fine for me. See my output.
run:
tést
BUILD SUCCESSFUL (total time: 1 second)
You should use charset encoding as ISO-8859-1. This charset supports latin characters.
This will help you
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(sample);
If you're running this in Eclipse and reading the output from Eclipse's console, try this:
Open Run Configurations (Menu Run > Run Configurations)
Find the run configuration you're using
Go to Common tab for that configuration
In Encoding section choose UTF-8
Consider the following program.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class HelloWorld {
public static void main(String[] args) {
System.out.println(Charset.defaultCharset());
char[] array = new char[3];
array[0] = '\u0905';
array[1] = '\u0905';
array[2] = '\u0905';
CharBuffer charBuffer = CharBuffer.wrap(array);
Charset utf8 = Charset.forName("UTF-8");
ByteBuffer encoded = utf8.encode(charBuffer);
System.out.println(new String(encoded.array()));
}
}
When I execute this using terminal,
java HelloWorld
I get properly encoded, shaped text. Default encoding was MacRoman.
Now when I execute the same code from Eclipse, I see incorrect text getting printed to the console.
When I change the file encoding option of Eclipse to UTF-8, it prints correct results in Eclipse.
I am wondering why this happens? Ideally, file encoding options should not have affected this code because here I am using UTF-8 explicitly.
Any idea why this is happening?
I am using Java 1.6 (Sun JDK), Mac OSx 10.7.
You need to specify what encoding you want to use when creating the string:
new String(encoded.array(), charset)
otherwise it will use the default charset.
Make sure the console you use to display the output is also encoded in UTF-8. In Eclipse for example, you need to go to Run Configuration > Common to do this.
System.out.println("\u0905\u0905\u0905");
would be the straight-forward usage.
And encoding is missing for the String constructor, defaulting to the set default encoding.
new String(encoded.array(), "UTF-8")
This happens because Eclipse uses the default ANSI encoding, not UFT-8. If your using a different encoding than what your IDE is using, you will get unreadable results.
you need to change your console run configuration.
click on "Run"
click on "Run Configurations" and then click on "common" tab
change Encoding to UTF
I have tried the following:
System.out.println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "UTF-8"), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "ISO-8859-1"), true).println("rājshāhi");
Which yields the following output:
r?jsh?hi
r?jsh?hi
rÄ?jshÄ?hi
r?jsh?hi
So, what am I doing wrong?
Thanks.
P.S.
I am using Eclipse Indigo on Windows 7. The output goes to the Eclipse output console.
The java file must be encoded correctly. Look in the properties for that file, and set the encoding correctly:
What you did should work, even the simple System.out.println if you have a recent version of eclipse.
Look at the following:
The version of eclipse you are using
Whether the file is encoded correctly. See #Matthew's answer. I assume this would be the case because otherwise eclipse wouldn't allow you to save the file (would warn "unsupported characters")
The font for the console (Windows -> Preferences -> Fonts -> Default Console Font)
When you save the text to a file whether you get the characters correctly
Actually, copying your code and running it on my computer gave me the following output:
rājshāhi
rājshāhi
rājshāhi
r?jsh?hi
It looks like all lines work except the last one. Get your System default character set (see this answer). Mine is UTF-8. See if changing your default character set makes a difference.
Either of the following lines will get your default character set:
System.out.println(System.getProperty("file.encoding"));
System.out.println(Charset.defaultCharset());
To change the default encoding, see this answer.
Make sure when you are creating your class Assign the Text file Encoding Value UTF-8.
Once a class is created with any other Text File Encoding later on you can't change the Encoding syle even though eclipse will allow you it won't reflect.
So create a new class with TextFile Encoding UTF 8.It will work definitely.
EDIT: In your case though you are trying to assing Text File encoding programatically it is not making any impact it is taking the container inherited encoding (Cp1252)
Using Latest Eclipse version helped me to achive UTF-8 encoding on console
I used Luna Version of Eclipse and set Properties->Info->Others->UTF-8
The following simple test is failing:
assertEquals(myStringComingFromTheDB, "£");
Giving:
Expected :£
Actual :£
I don't understand why this is happening, especially considering that is the encoding of the actual string (the one specified as second argument) to be wrong. The java file is saved as UTF8.
The following code:
System.out.println(bytesToHex(myStringComingFromTheDB.getBytes()));
System.out.println(bytesToHex("£".getBytes()));
Outputs:
C2A3
C382C2A3
Can anyone explain me why?
Thank you.
Update: I'm working under Windows 7.
Update 2: It's not related to JUnit, the following simple example:
byte[] bytes = "£".getBytes();
for(byte b : bytes)
{
System.out.println(Integer.toHexString(b));
}
Outputs:
ffffffc3
ffffff82
ffffffc2
ffffffa3
Update 3:
I'm working in IntelliJ Idea, I already checked the options and the encoding is UTF8. Also, it's written in the bottom bar and when I select and right click the pound sign it says "Encoding (auto-detected): UTF-8".
Update 4:
Opened the java file with a hex editor and the the pound sign is saved, correctly, as "C2A3".
Please note that assertEquals accepts parameters in the following order:
assertEquals(expected, actual)
so in your case string coming from DB is ok, but the one from your Java class is not (as you noticed already).
I guess that you copied £ from somewhere - probably along with some weird characters around it which your editor (IDE) does not print out (almost sure). I had similar issues couple of times, especially when I worked on MS Windows: e.g. ctrl+c & ctrl+v from website to IDE.
(I printed bytes of £ on my system with UTF8 encoding and this is C2A3):
for (byte b: "£".getBytes()) {
System.out.println(Integer.toHexString(b));
}
The other solution might be that your file is not realy UTF-8 encoded. Do you work on Windows or some other OS?
Some other possible solutions according to the question edits:
1) it's possible that IDE uses some other encoding. For eclipse see this thread: http://www.eclipse.org/forums/index.php?t=msg&goto=543800&
2) If both IDE settings and final file encodings are ok, than it's compiler issue. See:
Java compiler platform file encoding problem
I saved my Java source file specifying it's encoding type as UTF-8 (using Notepad, by default Notepad's encoding type is ANSI) and then I tried to compile it using:
javac -encoding "UTF-8" One.java
but it gave an error message"
One.java:1: illegal character: \65279
?public class One {
^
1 error
Is there any other way, I can compile this?
Here is the source:
public class One {
public static void main( String[] args ){
System.out.println("HI");
}
}
Your file is being read as UTF-8, otherwise a character with value "65279" could never appear. javac expects your source code to be in the platform default encoding, according to the javac documentation:
If -encoding is not specified, the platform default converter is used.
Decimal 65279 is hex FEFF, which is the Unicode Byte Order Mark (BOM). It's unnecessary in UTF-8, because UTF-8 is always encoded as an octet stream and doesn't have endianness issues.
Notepad likes to stick in BOMs even when they're not necessary, but some programs don't like finding them. As others have pointed out, Notepad is not a very good text editor. Switching to a different text editor will almost certainly solve your problem.
Open the file in Notepad++ and select Encoding -> Convert to UTF-8 without BOM.
This isn't a problem with your text editor, it's a problem with javac !
The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden !
If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.
The proposed solution of removing the BOM is only a workaround and not the proper solution.
This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058
Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.
Try javac -encoding UTF8 One.java
Without the quotes and it's UTF8, no dash.
See this forum thread for more links
See Below
For example we can discuss with an Program (Telugu words)
Program (UnicodeEx.java)
class UnicodeEx {
public static void main(String[] args) {
double ఎత్తు = 10;
double వెడల్పు = 25;
double దీర్ఘ_చతురస్ర_వైశాల్యం;
System.out.println("The Value of Height = "+ఎత్తు+" and Width = "+వెడల్పు+"\n");
దీర్ఘ_చతురస్ర_వైశాల్యం = ఎత్తు * వెడల్పు;
System.out.println("Area of Rectangle = "+దీర్ఘ_చతురస్ర_వైశాల్యం);
}
}
This is the Program while saving as "UnicodeEx.java" and change Encoding to "unicode"
**How to Compile**
javac -encoding "unicode" UnicodeEx.java
How to Execute
java UnicodeEx
The Value of Height = 10.0 and Width = 25.0
Area of Rectangle = 250.0
I know this is a very old thread, but I was experiencing a similar problem with PHP instead of Java and Google took me here. I was writing PHP on Notepad++ (not plain Notepad) and noticed that an extra white line appeared every time I called an include file. Firebug showed that there was a 65279 character in those extra lines.
Actually both the main PHP file and the included files were encoded in UTF-8. However, Notepad++ has also an option to encode as "UTF-8 without BOM". This solved my problem.
Bottom line: UTF-8 encoding inserts here and there this extra BOM character unless you instruct your editor to use UTF8 without BOM.
Works fine here, even edited in Notepad. Moral of the story is, don't use Notepad. There's likely a unprintable character in there that Notepad is either inserting or happily hiding from you.
I had the same problem. To solve it opened the file in a hex editor and found three "invisible" bytes at the beginning of the file. I removed them, and compilation worked.
Open your file with WordPad or any other editor except Notepad.
Select Save As type as Text Document - MS-DOS Format
Reopen the Project
To extend the existing answers with a solution for Linux users:
To remove the BOM on all .java files at once, go to your source directory and execute
find -iregex '.*\.java' -type f -print0 | xargs -0 dos2unix
Requires find, xargs and dos2unix to be installed, which should be included in most distributions. The first statement finds all .java files in the current directory recursively, the second one converts each of them with the dos2unix tool, which is intended to convert line endings but also removes the BOM.
The line endings conversion should have no effect as it should already be in Linux \n format on Linux if you configure your version control correctly but be warned that it does that as well in case you have one of those rare cases where that is not intended.
In the Intellij Idea(Settings>Editor>File Encodings), the project encoding was "windows-1256". So I used the following code to convert static strings to utf8
protected String persianString(String persianStirng) throws UnsupportedEncodingException {
return new String(persianStirng.getBytes("windows-1256"), "UTF-8");
}
Now It is OK!
Depending on the file encoding you should change "windows-1256" to a proper one