How can I determine the correct encoding?

How can I determine the correct encoding? - java

I was trying to print some chinese characters as below but this won't work. I suppose there should be some sort of encoding to be done. Can you please help mo on this?
public static void main(String[] args)
{
String myString = "奥妙洗衣粉";
System.out.println(myString);
// Output in eclipse: Some characters cannot be mapped using Cp1252 character encoding.
// Either change the encoding or remove the characters which are not supported
// by the Cp1252 character encoding.
}
EDIT: How can I do it (change/apply the encoding) programatically before printing the string?

Windows-1252 character encoding does not support the characters in your code. Use UTF-8.

You can change the default encoding in file output:
new PrintWriter(fileName, "UTF-8");
Another problem, the compiler can read only ASCII characters (but the JVM could read others as well). This means, strings cannot be built from foreign characters. The proper way to do it, determine the character's Unicode - 4 char hexadecimal code - and build like this:
String myString = "\u3b12\uc2d4hello"
This creates a string with the first char as the code 3b12 (escaped with the \u Unicode character) + c2d4 + hello.
Here's my output:
㬒싔hello

the console in eclipse does not support those characters by default.
here is a tutorial on that

the problem is that the eclipse console encoding is not utf-8.
you should change the console encoding,
i hope this link help you : http://www.mkyong.com/java/how-to-display-chinese-character-in-eclipse-console/

Related

Java String some characters not showing [duplicate]

I have a problem with turkish special characters on different machines. The following code:
String turkish = "ğüşçĞÜŞÇı";
String test1 = new String(turkish.getBytes());
String test2 = new String(turkish.getBytes("UTF-8"));
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
System.out.println(test1);
System.out.println(test2);
System.out.println(test3);
On a Mac the three Strings are the same as the original string. On a Windows machine the three lines are (Printed with the Netbeans 6.7 console):
?ü?ç?Ü?Ç?
ÄŸÃ¼ÅŸÃ§ÄžÃœÅžÃ‡Ä±
?ü?ç?Ü?Ç?
I don't get the problem.

String test1 = new String(turkish.getBytes());
You're taking the Unicode String including the Turkish characters, and turning it into bytes using the default encoding (using the default encoding is usually a mistake). You're then taking those bytes and decoding them back into a String, again using the default encoding. The result is you've achieved nothing (except losing any characters that don't fit in the default encoding); whether you have put a String through an encode/decode cycle has no effect on what the following System.out.println(test1) does because that's still printing a String and not bytes.
String test2 = new String(turkish.getBytes("UTF-8"));
Encodes as UTF-8 and then decodes using the default encoding. On Mac the default encoding is UTF-8 so this does nothing. On Windows the default encoding is never UTF-8 so the result is the wrong characters.
String test3 = new String(turkish.getBytes("UTF-8"), "UTF-8");
Does precisely nothing.
To write Strings to stdout with a different encoding than the default encoding, you'd create a encoder something like new OutputStreamWriter(System.out, "cp1252") and send the string content to that.
However in this case, it looks like the console is using Windows code page 1252 Western European (+1 ATorres). There is no encoding mismatch issue here at all, so you won't be able to solve it by re-encoding strings!
The default encoding cp1252 matches the console's encoding, it's just that cp1252 doesn't contain the Turkish characters ğşĞŞı at all. You can see the other characters that are in cp1252, üçÜÇ, come through just fine. Unless you can reconfigure the console to use a different encoding that does include all the characters you want, there is no way you'll be able to output those characters.
Presumably on a Turkish Windows install, the default code page will be cp1254 instead and you will get the characters you expect (but other characters don't work). You can test this by changing the ‘Language to use for non-Unicode applications’ setting in the Regional and Language Options Control Panel app.
Unfortunately no Windows locale uses UTF-8 as the default code page. Putting non-ASCII output onto the console with the stdio stream functions is not something that's really reliable at all. There is a Win32 API to write Unicode directly to the console, but unfortunately nothing much uses it.

Don't rely on the console, or on the default platform encoding. Always specify the character encoding for calls like getBytes and the String constructor taking a byte array, and if you want to examine the contents of a string, print out the unicode value of each character.
I would also advise either restricting your source code to use ASCII (and \uxxxx to encode non-ASCII characters) or explicitly specifying the character encoding when you compile.
Now, what bigger problem are you trying to solve?

You may be dealing with different settings of the default encoding.
java -Dfile.encoding=utf-8
versus
java -Dfile.encoding=something else
Or, you may just be seeing the fact that the Mac terminal window works in UTF-8, and the Windows DOS box does not work in UTF-8.
As per Mr. Skeet, you have a third possible problem, which is that you are trying to embed UTF-8 chars in your source. Depending on the compiler options, you may or may not be getting what you intend there. Put this data in a properties file, or use \u escapes.
Finally, also per Mr. Skeet, never, ever call the zero-argument getBytes().

If you are using AspectJ compiler do not forget to set it's encoding to UTF-8 too. I have struggled to find this for hours.

Unable to convert Hyphen to UTF-8

I'm reading some text that I got from Wikipedia.
The text contains hyphen like in this String: "Australia for the [[2011–12 NBL season]]"
I'm trying to do is to convert the text to utf-8, using this code:
String myStr = "Australia for the [[2011–12 NBL season]]";
new String(myStr.getBytes(), "utf-8");
The result is:
Australia for the [[2011�12 NBL season]]
The problem is that the hyphen is not being mapped correctly.
The hyphen value in bytes is [-106] (I have no idea what to do with it...)
Do you know how to convert it to a hyphen that utf-8 encoding recognizes?
I would be happy to replace other special characters as well by some general code, but also specific "hyphens" replacement code will help.

The problem code point is U+2013 EN DASH which can be represented with the escape \u2013.
Try replacing the string with "2011\u201312". If this works then there is a mismatch between your editor character encoding and the one the compiler is using.
Otherwise, the problem is with the transcoding operation from string to whatever device you are writing to. Anywhere where you convert from bytes to chars or chars to bytes is a potential point of corruption when the wrong encoding is used; this can include System.out.
Note: Java strings are always UTF-16.
new String(myStr.getBytes(), "utf-8");
This code takes UTF-16, converts it to the platform encoding, which might be anything, then pretends its UTF-8 and converts it back to UTF-16. At best, the platform encoding is UTF-8 and this is a no-op; otherwise it will just corrupt the data.
This is how you create UTF-8 in Java:
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // Java 7
You can read more here.

This is because the source code (editor) is maybe in Windows-1252 (extended Latin-1), and it is compiled with another encoding UTF-8 (compiler). These two encodings must be the same, or use in the source: "\u00AD", the ASCII representation of the hyphen.

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter

Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.

You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.

Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.

Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).

I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

Replacing Unicode character codes with characters in String in Java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).

Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.

I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.

If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

How to parse word-created special chars in java

I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
StartDate ΓÇô EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
Edited to add some code:
String projDateString = "08/2010 ΓÇô Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present
Also, using the same projDateString above, if I do:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
projDatestring: 08/2010 ΓÇô Present

You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.
You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
Say you have
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.
At that point you should be able to do
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.

s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
http://www.coderanch.com/how-to/java/WeirdWordCharacters

Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.
Windows-1252 is recognized by the IANA Charset Registry
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.

Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.