Java Unicode Confusion - java

HEy all, I have only just started attempting to learn Java and have run into something that is really confusing!
I was typing out an example from the book I am using. It is to demonstrate the
char data type.
The code is as follows :
public class CharDemo
{
public static void main(String [] args)
{
char a = 'A';
char b = (char) (a + 1);
System.out.println(a + b);
System.out.println("a + b is " + a + b);
int x = 75;
char y = (char) x;
char half = '\u00AB';
System.out.println("y is " + y + " and half is " + half);
}
}
The bit that is confusing me is the statement, char half = '\u00AB'. The book states that \u00AB is the code for the symbol '1/2'. As described, when I compile and run the program from cmd the symbol that is produced on this line is in fact a '1/2'.
So everything appears to be working as it should. I decided to play around with the code and try some different unicodes. I googled multiple unicode tables and found none of them to be consistent with the above result.
In every one I found it stated that the code /u00AB was not for '1/2' and was in fact for this:
http://www.fileformat.info/info/unic...r/ab/index.htm
So what character set is Java using, I thought UNicode was supposed to be just that, Uni, only one. I have searched for hours and nowhere can I find a character set that states /u00AB is equal to a 1/2, yet this is what my java compiler interprets it as.
I must be missing something obvious here! Thanks for any help!

It's a well-known problem with console encoding mismatch on Windows platforms.
Java Runtime expects that encoding used by the system console is the same as the system default encoding. However, Windows uses two separate encodings: ANSI code page (system default encoding) and OEM code page (console encoding).
So, when you try to write Unicode character U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK to the console, Java runtime expects that console encoding is the ANSI encoding (that is Windows-1252 in your case), where this Unicode character is represented as 0xAB. However, the actual console encoding is the OEM encoding (CP437 in your case), where 0xAB means ½.
Therefore printing data to Windows console with System.out.println() produces wrong results.
To get correct results you can use System.console().writer().println() instead.

The \u00ab character is not the 1/2 character; see this definitive code page from the Unicode.org website.
What you are seeing is (I think) a consequence of using the System.out PrintStream on a platform where the default character encoding is not UTF-8 or Latin-1. Maybe it is some Windows character set as suggested by #axtavt's answer? (It also has a plausible explanation of why \u00ab is displayed as 1/2 ... and not some "splat" character.)
(In Unicode and Latin-1, \00BD is the codepoint for the 1/2 character.)

0xAB is 1/2 in good old Codepage 437, which is what Windows terminals will use by default, no matter what codepage you actually set.
So, in fact, the char value represents the "«" character to a Java program, and if you render that char in a GUI or run it on a sane operating system, you will get that character. If you want to see proper output in Windows as well, switch your Font settings in CMD away from "Raster Fonts" (click top-left icon, Properties, Font tab). For example, with Lucida Console, I can do this:
C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is ½
C:\Users\Documents>chcp 1252
Active code page: 1252
C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is «
C:\Users\Documents>chcp 437
Active code page: 437

One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (e.g. Chinese or math symbols), not just in data strings, but in function and variable names too.
Here's a example code using unicode characters in class names and variable names.
class 方 {
String 北 = "north";
double π = 3.14159;
}
class UnicodeTest {
public static void main(String[] arg) {
方 x1 = new 方();
System.out.println( x1.北 );
System.out.println( x1.π );
}
}
Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.
The UTF-8 charset is specified by RFC 2279;
The UTF-16 charsets are specified by RFC 2781
The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.
When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
Also see this

Well, when I use that code I get the << as I should and 1/2 for \u00BD as it should be.
http://www.unicode.org/charts/

Related

Java byte stream non english characters

I read this code. As xanadu.txt content use "test". The file has 4 bytes size. If I use debug to run out.write(c) one byte at time and after each time open the file outagain.txt (with notepad) I see successively: t-->te-->tes-->test. OK
BUT if we change the content of the source file (xanadu.txt) to Greek (or other language) equivalent to test (τέστ) then the file now has 8 bytes size (I think because UTF we have 2 bytes per character). When debug again then it appears hieroglyphic character without meaning for each time out.write(c) runs. When the last byte (8th) prints then suddenly the original Greek word (τέστ) appears. Why? The same if we choose as destination the console stream (in netbeans) but in this case the strange characters remain at the end if debug but not if we run it normaly(!!!).
As you observe, a single char (16 bits in Java internal representation) turns into a variable number of bytes in a byte-stream representation, in particular UTF-8.
(Some characters occupy two char values; I shall ignore those, but the answer still applies, only more so)
If you're outputting 'byte-wise' as in your experiment, in some cases you'll have output a fractional character. That is an illegal sequence that makes no sense; some software (such as Notepad) will nevertheless try to make sense of it. That may even include guessing at the encoding. For example, I don't know this to be the case, but if the file is not valid UTF-8 in its first several bytes -- and we know your half-a-character output is not valid UTF-8 -- than maybe Notepad guesses at an entirely different encoding, that treats the byte sequence as a valid representation of entirely different characters.
tl;dr - garbage out, garbage displayed.
Modern computers have this gigantic table with 4 billion characters in it. Each character is identified by a single 32-bit number. All characters you can think of are in here; from the basic 'test' to 'τέστ' to snowman ( ☃ ), to special non-visible ones that indicate a right-to-left spelled word is coming up, to a bunch of ligatures (such as ff - which is a single character representing the ff ligature), to emoji, coloured and all: 😀.
This entire answer is essentially a sequence of these 32-bit numbers. But how would you like to store these in a file? That's where 'encoding' comes in. There are many, many encodings, and a crucial problem is that (almost) no encodings are 'detectable'.
It's like this:
If a complete stranger walks up to you and says "Hey!", what language are they speaking? Probably english. But maybe dutch, which also has 'Hey!'. It could also be japanese and they're not even greeting you, they're saying 'Yes' (more or less). How would you know?
The answer is, either from external context (if you're in the middle of Newcastle, UK, it's probably english), or because they explicitly tell you, but one is, well, external, and the other isn't common practice.
Text files are the same way.
They just contain the encoded text, they do not indicate what encoding it is. That means you need to tell the editor, or your newBufferedReader in java, or your browser when saving that txt content, what encoding you want. However, because that's annoying to have to do every time, most systems have a default choice. Some text editors even try to figure out what encoding it is, but just like that person say 'Hey!' to you might be english or japanese, with wildly different interpretations, the same happens with this semi-intelligent guessing at charset encoding.
This gets us to the following explanation:
You write τέστ in your editor and hit 'save'. What is your editor doing? Is it saving in UTF-16? UTF-8? UCS-4? ISO-8859-7? Completely different files are produced for all of these encodings! Given that it made 8 bytes, that meansa it's UTF-16 or UTF-8. Probably UTF-8.
You then copy these bytes over one by one, which is problematic: In UTF-8, a single byte can be half of a character. (You said: UTF-8 stores characters as 2 bytes; that's not true, UTF-8 stores characters such that every character is 1, 2, 3, or 4 bytes; it's variable length per byte! - each character in τέστ is stored as 2 bytes, though) - that means if you've copied over, say, 3 bytes, your text editor's ability to guess what it might be is severely hampered: It might guess UTF-8 but then realize that it isn't valid UTF-8 at all (because of that half-of-a-character you ended up with), so it guesses wrong, and shows you gobbledygook.
The lesson to learn here is twofold:
When you want to process characters, use char, Reader, Writer, String, and other character-oriented things.
When you want to process bytes, use byte, byte[], InputStream, OutputStream, and other byte-oriented things.
Never make the mistake that these two are easily interchangible, because they are not. Whenever you go from one 'world' to the other, you MUST specify charset encoding, because if not, java picks 'platform default', which you don't want (because now you have software that depends on an external factor and which cannot be tested. Yikes).
Default to UTF-8 for everything you can.
tl;dr
Read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Do not parse text files by octets (bytes). Use classes purpose-built for handling text. For example, use Files and its readAllLines method.
Details
Notice at the bottom of that tutorial page the caution that this is not the proper way to handle text files:
CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. Since xanadu.txt contains character data, the best approach is to use character streams, as discussed in the next section.
Text files may or may not use single octets to represent single characters, such as US-ASCII files. Your example code assumes one octet per character, which works for test as the content but not for τέστ as the content.
As a programmer, you must know from the publisher of your data file what encoding was used in writing the data representing the original text. Generally best to use UTF-8 encoding when writing text.
Write a text file with two lines:
test
τέστ
…and save using a text-editor with an encoding of UTF-8.
Read that file as a collection of String objects.
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
line = τέστ
UTF-16 versus UTF-8
You said:
I think because UTF we have 2 bytes per character)
No such thing as “UTF”.
UTF-16 encoding uses one or more pairs of octets per character.
UTF-8 encoding uses 1, 2, 3, or 4 octets per character.
Text content such as τέστ can be written to a file using either encoding, UTF-16 or UTF-8. Be aware that UTF-16 is “considered harmful”, and UTF-8 is preferred generally nowadays. Note that UTF-8 is an superset of US-ASCII, so any US-ASCII file is also a UTF-8 file.
Characters as code points
If you want to example each character in text, treat them as code point numbers.
Never use the char type in Java. That type is unable to represent even half of the characters defined in Unicode, and is now obsolete.
We can interrogate each character in our example file seen above by adding these two lines of code.
IntStream codePoints = line.codePoints();
codePoints.forEach( System.out :: println );
Like this:
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
IntStream codePoints = line.codePoints();
codePoints.forEach( System.out :: println );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
116
101
115
116
line = τέστ
964
941
963
964
If you are not yet familiar with streams, convert IntStream to a collection, such as a List of Integer objects.
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
List < Integer > codePoints = line.codePoints().boxed().collect( Collectors.toList() );
for ( Integer codePoint : codePoints )
{
System.out.println( "codePoint = " + codePoint );
}
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
codePoint = 116
codePoint = 101
codePoint = 115
codePoint = 116
line = τέστ
codePoint = 964
codePoint = 941
codePoint = 963
codePoint = 964
Given a code point number, we can determine the intended character.
String s = Character.toString( 941 ) ; // έ character.
Be aware that some textual characters may be represented as multiple code points, such as a letter with a diacritical. (Text-handling is not a simple matter.)

When I assign char (from literal or otherwise), what "java internal encoding is UTF16" means here? In what encoding is it stored in char?

//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).

How to parse word-created special chars in java

I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
StartDate ΓÇô EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
Edited to add some code:
String projDateString = "08/2010 ΓÇô Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present
Also, using the same projDateString above, if I do:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
projDatestring: 08/2010 ΓÇô Present
You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.
You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
Say you have
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.
At that point you should be able to do
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.
s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
http://www.coderanch.com/how-to/java/WeirdWordCharacters
Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.
Windows-1252 is recognized by the IANA Charset Registry
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.
Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

I made the following "simulation":
byte[] b = new byte[256];
for (int i = 0; i < 256; i ++) {
b[i] = (byte) (i - 128);
}
byte[] transformed = new String(b, "cp1251").getBytes("cp1251");
for (int i = 0; i < b.length; i ++) {
if (b[i] != transformed[i]) {
System.out.println("Wrong : " + i);
}
}
For cp1251 this outputs only one wrong byte - at position 25.
For KOI8-R - all fine.
For cp1252 - 4 or 5 differences.
What is the reason for this and how can this be overcome?
I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice.
Update: representing it in ISO-8859-1 works, and I'll use it for the byte[] part, and cp1251 for the textual part, so the question remains only out of curiousity
Some of the "bytes" are not supported in the target set - they are replaced with the ? character. When you convert back, ? is normally converted to the byte value 63 - which isn't what it was before.
What is the reason for this
The reason is that character encodings are not necesarily bijective and there is no good reason to expect them to be. Not all bytes or byte sequences are legal in all encodings, and usually illegal sequences are decoded to some sort of placeholder character like '?' or U+FFFD, which of course does not produce the same bytes when re-encoded.
Additionally, some encodings may map some legal different byte sequences to the same string.
It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable".
The javadoc for String(byte[], String) says this:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.
Other constructors say this:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem.
I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?
Actually there shall be one difference: a byte of value 24 is converted to a char of value 0xFFFD; that's the "Unicode replacement character", used for untranslatable bytes. When converted back, you get a question mark (value 63).
In CP1251, the code 24 means "end of input" and cannot be part of a proper string, which is why Java deems it as "untranslatable".
Historical reason: in the ancient character encodings (EBCDIC, ASCII) the first 32 codes have special 'control' meaning and they may not map to readable characters. Examples: backspace, bell, carriage return. Newer character encoding standards usually inherit this and they don't define Unicode characters for every one of the first 32 positions. Java characters are Unicode.

Categories