replacing special chars by ASCII equivalent before writing to file

replacing special chars by ASCII equivalent before writing to file - java

I am getting an java.nio.charset.UnmappableCharacterException when trying to write a csv that contains the char µ.
I need to write the file with ASCII encoding so that Excel can open it directly without the user having to do anything.
How can I convert my µ char into its ASCII equivalent before writing to file ?

ASCII only takes the lower 7 bits of the character. So there are only 2^7 = 128 characters possible. However, of those only 95 are actually printable (read: visible), and that includes the space character (because it still has a fixed width). Unfortunately, your character is not part of that list.
The most used ASCII compatible character encoding is probably UTF-8 by now. However, that requires two bytes to create Mu / the micro-symbol (0xC2 0xB5).
Western Latin, also known as ISO/IEC 8859-1 (since 1987), has the character at U+00B5 (Alt+0181), translated as 0xB5 in hexadecimal notation. Western Latin is however not used that much as a name. Instead, the extended version called Windows-1252 is used, with the character at the same location.
You can see the Unicode encoding here and the Windows-1252 here (at the fileformat.info site).

Related

When I assign char (from literal or otherwise), what "java internal encoding is UTF16" means here? In what encoding is it stored in char?

//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?

In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.

Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.

Changing encoding in idea intellij doesn't work

I have a .java file with a string String s="P�rsh�ndetje bot�!";.
When I open this file in Notepad++ and change encoding to ISO-8859-1 it shows appropriate string: "Përshëndetje botë!", but if i open the file in idea intellij and change encoding to ISO-8859-1, it gives me a warning of how some symbols can't be converted and then replaces those symbols with ? mark: "P?rsh?ndetje bot?!".
Why is this happening? Why Notepad++ is able to convert the file, while idea is not?

I'm not sure, but it is possible that when you first opened the file it was read as UTF-8 and the invalid byte sequences were turned into the Unicode replacement character, then when you try to convert to ISO-8859-1 it is trying to convert the Unicode replacement character but there is no value for that in ISO-8859-1 so it is converted to ? instead.
(Even though text like "ërs" can be represented in Unicode and thus UTF-8, the ISO-8859-1 encoding of "ërs" is EB 72 73 which is the start byte of a three-byte UTF-8 sequence, but the next two bytes are not continuation bytes, so a program treating it as UTF-8 would think those accented characters are invalid.)
I think you need to get IntelliJ to open the file as ISO-8859-1, rather than opening it first as UTF-8 and then trying to convert to ISO-8859-1.
(When you switch the encoding in Notepad++ it must be going back to the original bytes of the file and interpreting them as ISO-8859-1, rather than trying to convert content that it has already altered by changing invalid bytes to the replacement character.)
Note that ë is a perfectly valid Unicode character. It can be represented as either U+00EB, Latin small letter e with diaeresis, or as two code points, U+0065 and U+0308, Latin small letter e combined with Combining diaeresis. But U+00EB would be encoded in UTF-8 as the two-byte sequence C3 AB, and for U+0065 U+0308 the "e" would be encoded as itself, 65, and U+0308 would be encoded as CC 88.
So "ë" in UTF-8 must be either C3 AB or 65 CC 88. It can't be EB.

I believe there is some bug here in IDEA (where the default encoding is UTF-8) in that when you convert the file containing valid ISO-8859-1 encoded characters and change the file encoding to ISO-8859-1, it messes it up. The particular codepoint that it messes up is ë. For some reason, it replaces it with \ufffd whereas its correct codepoint is \u00eb. This is the character that shows up as � in your editor.
My suggestion is to just use UTF-8 and not change it to ISO-8859-1. UTF-8 is backward compatible with ISO-8859-1 and you could write this string using the IME on your OS (which appears to be Windows). I am not sure how to do it on Windows, but on a Mac, I use the U+ keyboard
and then add this character as 00eb while keeping the ALT key pressed. Then it shows up correctly:

Why character can't print after 128 number

In my project i am try to convert Binary number to integer and convert integer to Character. But after 128 number print only '?' character. Please help me how to print up to 250 characters. My code is
class b
{
public static void main(String[] args)
{
String dec1="11011001" ;
System.out.println(dec1);
int dec = Integer.parseInt(dec1, 2);
System.out.println(dec);
String str = new Character((char)dec).toString();
System.out.println("decrypted number is "+str);
}
}
Thank you.

Not all byte values have a printable character associated with them, ASCII does not, many/most unicode bytes do not, and the range 0x00 - 0x1f are all unprintable controls such as DC1, Bell, Backspace, etc. Unicode has the same first 32 characters reserved as non-printable.
Byte values above 127 (0x7f) have different meanings in different encodings, there are many encodings. Historically ASCII was the default encoding and there were many extensions to it. These days the standard is unicode which exists in several varieties including UTF-8, UTF-16 (LE, BE and BOM) and UTF-32 (LE, BE and BOM). UTF8 is common for interchange especially over the net and UTF-16 internally in many systems.
Depending on the encoding and glyph (displayed representation) it may take from one to over 16 bytes to represent a single glyph. Emoji mostly are in code plane 1 meaning that they require more than 16-bits for their code point (unicode is a 21-bit encoding system). Additionally some glyphs are represented by a sequence of code points, examples are flags which combine a country with the flag and Emoji joined with "joiners".
In the case of 217 (0xd9) that is not a legal codepoint in UTF-8 but 217 as two bytes (16-bit integer) (0x00d9) is a valid representation of Ù.
See ASCII and Unicode.

As per your code,First the binary will be converted to Integer and Then you are converting Integer to the Character which is done by checking the ASCII value. It will return the character having same ASCII value as the Integer dec1 you are converting. Since in ASCII TABLE the values are upto 127, You will get the character upto the integer value 127, So for the greater value of dec1 than 127, You will get character as ? which will be then converted into String. First 32 elements are non-printable characters so you will get some strange symbol for it but for value of dec1 in the range 32-126, You will get the character assigned to that particular ASCII value as per ASCII TABLE. Since the value 127 is assigned to DEL, you will get strange symbol for value of dec 127.

The issue is that your console's encoding doesn't match the encoding of the output of your Java program. I don't know what console you're using, but on Windows, you can run this command to see your current encoding:
chcp
The default console's encoding for USA is 437 and for Western Europe and Canada 850. These encodings have the 128 characters from ASCII encoding and 128 additional characters that are different from one encoding to another. You get nothing beyond the 128 ASCII characters because your Java output's encoding doesn't match the console's encoding. You have to change one of them to match the other.
You can change your console's encoding to UTF-8 by running this command:
chcp 65001
If you're not on Windows, you'll have to search for the equivalent commands for your system. But I believe on most Linux & Unix derived systems, you can use the locale command to see the current encoding and the export command to change it.

I receive the following output from your code. I assume that you run the program in an environment/console that doesn't support the character. You need a console that support UTF-8, UTF-16 or similar to be able to print all characters you setup numerical values for.
11011001
217
decrypted number is Ù

String that cannot be represented in UTF-8

I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.

The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).

is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.

Why I need use encoding in String.getBytes(charsetName)

Ususally when I need to convert my string to byte[] I use getBytes() without param. I was checked it is not save I should use charset. Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?

Ususally when I need to convert my string to byte[] I use getBytes() without param.
Stop doing that right now. I would suggest that you always specify an encoding. If you want to use the platform default encoding (which is what you'll get if you don't specify one), then do that explicitly so that it's clearer. But that should very rarely be the approach anyway. Personally I use UTF-8 in almost all cases.
Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Nope. For example, using UTF-16, 'A' will be two bytes - 0x41 0x00 or 0x00 0x41 (depending on the endianness). In EBCDIC encodings it could be something completely different.
Most encodings treat ASCII characters in the same way - but characters outside ASCII are represented very differently in different encodings (and many encodings only support a subset of Unicode).
See my article on Unicode (C#-focused, but the principles are the same) for a few more details - and links to more information than you're ever likely to want.

Different character encodings lead to different ways characters get parsed. In Ascii, sure 'A' will parse to 0x41. In other encodings, this will be different.
This is why when you go to some webpages, you may see a bunch of weird characters. The browser doesn't know how to decode it, so it just decodes to the default.

Some background: When text is stored in files or sent between computers over a socket, the text characters are stored or sent as a sequence of bits, almost always grouped in 8-bit bytes. The characters all have defined numeric values in Unicode, so that 'A' always has the value 0x41 (well, there are actually two other A's in the Unicode character set, in the Greek and Russian alphabets, but that's not relevant). But there are many mechanisms for how those numeric codes are translated to a sequence of bits when storing in a file or sending to another computer. In UTF-8, 0x41 is represented as 8 bits (the byte 0x41), but other numeric values (code points) will be converted to 16 or more bits with an algorithm that rearranges the bits; in UTF-16, 0x41 is represented as 16 bits; and there are other encodings like JIS and some which are capable of representing some but not all of the Unicode characters. Since String.getBytes() was intended to return a byte array that contains the bytes to be sent to a file or socket, the method needs to know what encoding it's supposed to use when creating those bytes. Basically the encoding will have to be the same one that a program later reading a file, or a computer at the other end of the socket, expects it to be.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.