What happens if your input file contains some unsupported character? - java

I have this text file which might contain some unsupported characters in the Latin1 character set, which is the default character set of my JVM.
What would those characters be turned into when my java program tries to read from the file?
Concretely, supposed I had a 2-byte long character in the file, would it be read as a one-byte character (because each character in Latin1 is only 1-byte long)?
Thanks,

I can't use the InputStreaReader option, because the file has to be read with Latin1.
And
I have this text file which might contain some unsupported characters in the Latin1 character set ...
You have contradictory requirements here.
Either the file is LATIN-1 (and there no "unsupported characters") or it is not LATIN-1. If it is not LATIN-1, you should be trying to find out what character set / encoding it really is, and use that one instead of LATIN-1 to read the file.
As other answers / comments have explained, you can either change the JVM's default character set, or specify a character set explicitly when you open the Reader.
I'm having trouble setting the default character set of my JVM .
Please explain what you are trying and what problems you are having.
(and was a bit afraid of messing it up!)
COWARD! :-)
FWIW - if you try to read a data stream in (say) LATIN-1 and the data stream is not actually in LATIN-1, then you can expect the following:
Characters that encode the same in LATIN-1 and the actual character set will be passed unharmed.
Characters that don't encode the same, will either be replaced by a character that means "unknown character" (e.g. a question mark), or will be garbled. Which happens will depend on whether that byte or byte sequence at issue encodes a valid (but wrong) character, or no character at all.
The net result will be partially garbled text. The garbling may or may not be reversible, depending on exactly what the real character set and characters are. But it is best to avoid "going there" ... by using the RIGHT character set to decode in the first instance.

First of all you can specify the character set to use when reading a file. See for example: java.io.InputStreamReader
Secondly. Yes if reading using a 1 byte character set then each byte will be used to map to one character.
Thirdly: Test it and you shall see, beyond doubt what actually happens!

If you don't know the charset you'll have to guess it. This is tricky and error prone.
Here is a question regarding this issue:
How can I detect the encoding/codepage of a text file
Check out how you can fool notepad into guessing wrong.

Related

When I assign char (from literal or otherwise), what "java internal encoding is UTF16" means here? In what encoding is it stored in char?

//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use
PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");
or
PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");
I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.
When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"
file -bi example.txt
However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".
file -bi example-no-european-letters.txt
What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?
Should I just use a charset "ISO-8559-1" and everything will be ok?
If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.
ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).
However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.
TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).
It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.
If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.
Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.
Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

Can a file be encoded in multiple charsets in Java?

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.
You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.
Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.
I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.

Woodstox parser doesn't support certain unicode character

I am new at using WoodStox. I have to read all possible combination of Unicode character and write to XML file. WoodStox fails wile reading certain Unicode character. Can some one help me how can i either skip the character when encountered or some solution to write that Unicode character in XML file.
The exception i get is
Error on line 1 column 1404735 of 24364002-data-set-results.xml:
SXXP0003: Error reported by XML parser: Character reference "&#xfffe" is an invalid XML character.
Exception is : net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: Character reference "&#xfffe" is an invalid XML character.
I am not familiar with WoodStox either, but I can say that FFFE is indeed not a valid Unicode character, so it is probably more a problem with the input than with the parser. FFFE is most commonly used by some encoders in UTF-16 encoding to indicate the byte order (little or big endian). Depending on whether it is read back as FFFE or as FEFF, the decoder knows which byte order to choose. But it is not a standard and not all decoders support it.
When used as such, it is always the first two bytes of the file.
So, what you need to check is:
Are you using the correct character encoding (usually either UTF-8 or UTF-16)?
If using UTF-16, does your file start with FFFE or FFEF?
Does WoodStox have a setting that enables automatically detection of byte order?
Worst case, if your file starts with FFFE or FFEF, simply remove it from the file before you feed it to WoodStox. Make sure that you set the correct byte-order in WoodStox, though.

Categories