String UTF8 encoding issue - java

The following simple test is failing:
assertEquals(myStringComingFromTheDB, "£");
Giving:
Expected :£
Actual :£
I don't understand why this is happening, especially considering that is the encoding of the actual string (the one specified as second argument) to be wrong. The java file is saved as UTF8.
The following code:
System.out.println(bytesToHex(myStringComingFromTheDB.getBytes()));
System.out.println(bytesToHex("£".getBytes()));
Outputs:
C2A3
C382C2A3
Can anyone explain me why?
Thank you.
Update: I'm working under Windows 7.
Update 2: It's not related to JUnit, the following simple example:
byte[] bytes = "£".getBytes();
for(byte b : bytes)
{
System.out.println(Integer.toHexString(b));
}
Outputs:
ffffffc3
ffffff82
ffffffc2
ffffffa3
Update 3:
I'm working in IntelliJ Idea, I already checked the options and the encoding is UTF8. Also, it's written in the bottom bar and when I select and right click the pound sign it says "Encoding (auto-detected): UTF-8".
Update 4:
Opened the java file with a hex editor and the the pound sign is saved, correctly, as "C2A3".

Please note that assertEquals accepts parameters in the following order:
assertEquals(expected, actual)
so in your case string coming from DB is ok, but the one from your Java class is not (as you noticed already).
I guess that you copied £ from somewhere - probably along with some weird characters around it which your editor (IDE) does not print out (almost sure). I had similar issues couple of times, especially when I worked on MS Windows: e.g. ctrl+c & ctrl+v from website to IDE.
(I printed bytes of £ on my system with UTF8 encoding and this is C2A3):
for (byte b: "£".getBytes()) {
System.out.println(Integer.toHexString(b));
}
The other solution might be that your file is not realy UTF-8 encoded. Do you work on Windows or some other OS?
Some other possible solutions according to the question edits:
1) it's possible that IDE uses some other encoding. For eclipse see this thread: http://www.eclipse.org/forums/index.php?t=msg&goto=543800&
2) If both IDE settings and final file encodings are ok, than it's compiler issue. See:
Java compiler platform file encoding problem

Related

java does not support latin characters

I am running the following java program in eclipse. My string contain latin character.When I am printing the string it looks some weird. Here my code is
String sample = "tést";
System.out.println(sample);
Output:
t?st
please help me. thanks in advance
The actual string will contain the latin character since Java strings are UTF-16. You could verify this with a good debugger.
It's the rendering on your console of the println call that is at fault.
Java does support Latin characters. Your display font at output probably doesn't.
Other option is that you have strange (non-UTF) encoding in eclipse.
I use netbeans, it works fine for me. See my output.
run:
tést
BUILD SUCCESSFUL (total time: 1 second)
You should use charset encoding as ISO-8859-1. This charset supports latin characters.
This will help you
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(sample);
If you're running this in Eclipse and reading the output from Eclipse's console, try this:
Open Run Configurations (Menu Run > Run Configurations)
Find the run configuration you're using
Go to Common tab for that configuration
In Encoding section choose UTF-8

Java compilaton issue with ß character

I have come across a strange issue. In the below piece of code, I am searching for the presence of ß.
public static void main(String[] args) {
char [] chArray = {'ß'};
String str = "Testß";
for(int i=0; i<chArray.length; i++){
if(str.indexOf(chArray[i])>-1){
System.out.println("ß is present");
break;
}
}
}
I have a web application running on JBOSS in linux, Java 6. The above code doesn't detect the presence of ß when include the code in the above specified application.
Surprisingly, if I compile the same file in my eclipse workspace and then apply the patch in the application it runs as expected!
Points to note:
The application build environment is a black-box to me, hence no idea if there is any -encoding option is present for the javac command or something like that
My eclipse's JRE is java8, but the compiler version set for the project is Java6
I changed the value from ß to unicode equivalent to \u00DF in the array declaration, but still the behavior is same.
char [] chArray = {'\u00DF'};
When I decompiled the class file generated the character array declared value was shown as 65533, which is \uFFFD, nothing but replacement character which is used for unidentified symbol. I used JD-GUI as decompiler, which I dont see trustworthy!
Need your help folks! I am sure it is not same as: case sensitive issue of beta Java's equalsIgnoreCase fails with ß ("Sharp S" used in German alphabet)
Thanks in advance
I think your problem is the encoding of ß. You have two options to solve your error:
First convert your java source code into ascii chars, and then compile it:
native2ascii "your_class_file.java"
javac "your_class_file.java"
Compile your java file with your encoding, utf-8 on linux and iso-8859-15 on windows:
javac -encoding "encoding" "your_class_file.java"
As far as I can judge it, it should have worked with replacing "ß" with "\u00df". If the solutions above don't work, print every char and its unicode value to System.out and look which char is 'ß'.
Another error might be that you read the text in an encoding that doesn't support ß; try reading your String by reading the bytes and call:
String input = new String(input_bytes, StandartCharsets.UTF_8); // on linux
String input = new String(input_bytes, StandartCharsets.ISO_8859_1); // on windows
For more information on charsets, see StandartCharsets class reference.
Thanks for your time and responses!
The actual problem was the class file was not generated in the build, hence the change was not reflecting. Using ß's unicode value \u00DF in the java source file should work fine.

How to print "rājshāhi" to the Eclipse output console?

I have tried the following:
System.out.println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "UTF-8"), true).println("rājshāhi");
new PrintWriter(new OutputStreamWriter(System.out, "ISO-8859-1"), true).println("rājshāhi");
Which yields the following output:
r?jsh?hi
r?jsh?hi
rÄ?jshÄ?hi
r?jsh?hi
So, what am I doing wrong?
Thanks.
P.S.
I am using Eclipse Indigo on Windows 7. The output goes to the Eclipse output console.
The java file must be encoded correctly. Look in the properties for that file, and set the encoding correctly:
What you did should work, even the simple System.out.println if you have a recent version of eclipse.
Look at the following:
The version of eclipse you are using
Whether the file is encoded correctly. See #Matthew's answer. I assume this would be the case because otherwise eclipse wouldn't allow you to save the file (would warn "unsupported characters")
The font for the console (Windows -> Preferences -> Fonts -> Default Console Font)
When you save the text to a file whether you get the characters correctly
Actually, copying your code and running it on my computer gave me the following output:
rājshāhi
rājshāhi
rājshāhi
r?jsh?hi
It looks like all lines work except the last one. Get your System default character set (see this answer). Mine is UTF-8. See if changing your default character set makes a difference.
Either of the following lines will get your default character set:
System.out.println(System.getProperty("file.encoding"));
System.out.println(Charset.defaultCharset());
To change the default encoding, see this answer.
Make sure when you are creating your class Assign the Text file Encoding Value UTF-8.
Once a class is created with any other Text File Encoding later on you can't change the Encoding syle even though eclipse will allow you it won't reflect.
So create a new class with TextFile Encoding UTF 8.It will work definitely.
EDIT: In your case though you are trying to assing Text File encoding programatically it is not making any impact it is taking the container inherited encoding (Cp1252)
Using Latest Eclipse version helped me to achive UTF-8 encoding on console
I used Luna Version of Eclipse and set Properties->Info->Others->UTF-8

I have UTF-8 - but still get "Invalid byte 1 of 1-byte UTF-8 sequence"

I create a XML String on the fly (NOT reading from a file). Then I use Cocoon 3 to transform it via FOP to a PDF. Somewhere in the middle Xerces runs. When I use the hardcoded stuff everything works. As soon as I put a german Umlaut into the database and enrich my xml with that data I get:
Caused by: org.apache.cocoon.pipeline.ProcessingException: Can't parse the XML string.
at org.apache.cocoon.sax.component.XMLGenerator$StringGenerator.execute(XMLGenerator.java:326)
at org.apache.cocoon.sax.component.XMLGenerator.execute(XMLGenerator.java:104)
at org.apache.cocoon.pipeline.AbstractPipeline.invokeStarter(AbstractPipeline.java:146)
at org.apache.cocoon.pipeline.AbstractPipeline.execute(AbstractPipeline.java:76)
at de.grobmeier.tab.webapp.modules.documents.InvoicePipeline.generateInvoice(InvoicePipeline.java:74)
... 87 more
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:554)
I have then debugged my app and found out, my "Ä" (which comes frome the database) has the byte value of 196, which is C4 in hex. This is what I have expected according to this: http://www.utf8-zeichentabelle.de/
I do not know why my code fails.
I have then tried to add a BOM manually, like that:
byte[] bom = new byte[3];
bom[0] = (byte) 0xEF;
bom[1] = (byte) 0xBB;
bom[2] = (byte) 0xBF;
String myString = new String(bom) + inputString;
I know this is not exactly good, but I tried it - of course it failed. I have tried to add a xml header in front:
<?xml version="1.0" encoding="UTF-8"?>
Which failed too. Then I combined it. Failed.
After all I tried something like that:
xmlInput = new String(xmlInput.getBytes("UTF8"), "UTF8");
Which is doing nothing in fact, because it is already UTF-8. Still it fails.
So... any ideas what I am doing wrong and what Xerces is expecting from me?
Thanks
Christian
If your database contains only a single byte (with value 0xC4) then you aren't using UTF-8 encoding.
The character "LATIN CAPITAL LETTER A WITH DIAERESIS" has a code-point value U+00C4, but UTF-8 can't encode that in a single byte. If you check the third column "UTF-8 (hex.)" on UTF8-zeichentabelle.de you'll see that UTF-8 encodes that as 0xC3 84 (two bytes).
Please read Joel's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" for more info.
EDIT: Christian found the answer himself; turned out it was a problem in the Cocoon 3 SAX component (I guess it's the alpha 3 version). It turns out that if you pass an XML as a String into the XMLGenerator class, something will go wrong during SAX parsing causing this mess.
I looked up the code to find the actual problem in Cocoon-stax:
if (XMLGenerator.this.logger.isDebugEnabled()) {
XMLGenerator.this.logger.debug("Using a string to produce SAX events.");
}
XMLUtils.toSax(new ByteArrayInputStream(this.xmlString.getBytes()), XMLGenerator.this.getSAXConsumer();
As you can see, the call getBytes() will create a Byte array with the JRE's default encoding which will then fail to parse. This is because the XML declares itself to be UTF-8 whereas the data is now in bytes again, and likely using your Windows codepage.
As a workaround, one can use the following:
new org.apache.cocoon.sax.component.XMLGenerator(xmlInput.getBytes("UTF-8"),
"UTF-8");
This will trigger the right internal actions (as Christian found out by experimenting with the API).
I've opened an issue in Apache's bug tracker.
EDIT 2: The issue is fixed and will be included in an upcoming release.
The C4 you see on that page refers to the unicode code point, U+00C4. The byte sequence used to represent such a code point in UTF-8 is NOT "\xC4". What you want is what's in the UTF-8 (hex.) column, namely "\xC3\x84".
Therefore, your data is not in UTF-8.
You can read about how data is encoded in UTF-8 here.
I'm running Windows 7 with TextPad as a text editor for manually building the xml data file. I was getting the MalformedByteSequenceException. My spec in the xml file was UTF-8. After poking around, I found that my editor had a tool "Tools ... Convert to DOS". I did that, re-saved the file, and the exception went away and my code ran fine.
I then looked at the default encoding for that file type in my editor. It was ASCII, though when I changed the xml encoding parameter to ASCII, I got another different MalformedByteSequenceException.
So on Windows systems, you might try keeping the xml encoding to UTF-8, but save the file encoded DOS. I did not dig any further as to why this works.

How to compile a java source file which is encoded as "UTF-8"?

I saved my Java source file specifying it's encoding type as UTF-8 (using Notepad, by default Notepad's encoding type is ANSI) and then I tried to compile it using:
javac -encoding "UTF-8" One.java
but it gave an error message"
One.java:1: illegal character: \65279
?public class One {
^
1 error
Is there any other way, I can compile this?
Here is the source:
public class One {
public static void main( String[] args ){
System.out.println("HI");
}
}
Your file is being read as UTF-8, otherwise a character with value "65279" could never appear. javac expects your source code to be in the platform default encoding, according to the javac documentation:
If -encoding is not specified, the platform default converter is used.
Decimal 65279 is hex FEFF, which is the Unicode Byte Order Mark (BOM). It's unnecessary in UTF-8, because UTF-8 is always encoded as an octet stream and doesn't have endianness issues.
Notepad likes to stick in BOMs even when they're not necessary, but some programs don't like finding them. As others have pointed out, Notepad is not a very good text editor. Switching to a different text editor will almost certainly solve your problem.
Open the file in Notepad++ and select Encoding -> Convert to UTF-8 without BOM.
This isn't a problem with your text editor, it's a problem with javac !
The Unicode spec says BOM is optionnal in UTF-8, it doesn't say it's forbidden !
If a BOM can be there, then javac HAS to handle it, but it doesn't. Actually, using the BOM in UTF-8 files IS useful to distinguish an ANSI-coded file from an Unicode-coded file.
The proposed solution of removing the BOM is only a workaround and not the proper solution.
This bug report indicates that this "problem" will never be fixed : https://web.archive.org/web/20160506002035/http://bugs.java.com/view_bug.do?bug_id=4508058
Since this thread is in the top 2 google results for the "javac BOM" search, I'm leaving this here for future readers.
Try javac -encoding UTF8 One.java
Without the quotes and it's UTF8, no dash.
See this forum thread for more links
See Below
For example we can discuss with an Program (Telugu words)
Program (UnicodeEx.java)
class UnicodeEx {
public static void main(String[] args) {
double ఎత్తు = 10;
double వెడల్పు = 25;
double దీర్ఘ_చతురస్ర_వైశాల్యం;
System.out.println("The Value of Height = "+ఎత్తు+" and Width = "+వెడల్పు+"\n");
దీర్ఘ_చతురస్ర_వైశాల్యం = ఎత్తు * వెడల్పు;
System.out.println("Area of Rectangle = "+దీర్ఘ_చతురస్ర_వైశాల్యం);
}
}
This is the Program while saving as "UnicodeEx.java" and change Encoding to "unicode"
**How to Compile**
javac -encoding "unicode" UnicodeEx.java
How to Execute
java UnicodeEx
The Value of Height = 10.0 and Width = 25.0
Area of Rectangle = 250.0
I know this is a very old thread, but I was experiencing a similar problem with PHP instead of Java and Google took me here. I was writing PHP on Notepad++ (not plain Notepad) and noticed that an extra white line appeared every time I called an include file. Firebug showed that there was a 65279 character in those extra lines.
Actually both the main PHP file and the included files were encoded in UTF-8. However, Notepad++ has also an option to encode as "UTF-8 without BOM". This solved my problem.
Bottom line: UTF-8 encoding inserts here and there this extra BOM character unless you instruct your editor to use UTF8 without BOM.
Works fine here, even edited in Notepad. Moral of the story is, don't use Notepad. There's likely a unprintable character in there that Notepad is either inserting or happily hiding from you.
I had the same problem. To solve it opened the file in a hex editor and found three "invisible" bytes at the beginning of the file. I removed them, and compilation worked.
Open your file with WordPad or any other editor except Notepad.
Select Save As type as Text Document - MS-DOS Format
Reopen the Project
To extend the existing answers with a solution for Linux users:
To remove the BOM on all .java files at once, go to your source directory and execute
find -iregex '.*\.java' -type f -print0 | xargs -0 dos2unix
Requires find, xargs and dos2unix to be installed, which should be included in most distributions. The first statement finds all .java files in the current directory recursively, the second one converts each of them with the dos2unix tool, which is intended to convert line endings but also removes the BOM.
The line endings conversion should have no effect as it should already be in Linux \n format on Linux if you configure your version control correctly but be warned that it does that as well in case you have one of those rare cases where that is not intended.
In the Intellij Idea(Settings>Editor>File Encodings), the project encoding was "windows-1256". So I used the following code to convert static strings to utf8
protected String persianString(String persianStirng) throws UnsupportedEncodingException {
return new String(persianStirng.getBytes("windows-1256"), "UTF-8");
}
Now It is OK!
Depending on the file encoding you should change "windows-1256" to a proper one

Categories