Printing to System.out as UTF-8 regardless of platform default - java

In Java, System.out.print takes encoding from the platform default.
Suppose you want to always print as UTF-8 regardless of the platform default, what's the most idiomatic/efficient way to do this?
The best idea I have so far is to explicitly spell out two steps, first convert string to UTF-8 bytes with s.getBytes(StandardCharsets.UTF_8), then print the bytes with System.out.write, which bypasses the question of encoding.
Is this the recommended way, or is there a more idiomatic way?

Related

Store Arabic in String and insert it into database using Java

I am trying to pass Arabic String into Function that store it into a database but the String's Chars is converted into '?'
as example
String str = new String();
str = "عشب";
System.out.print(str);
the output will be :
"???"
and it is stored like this in the database.
and if i insert into database directly it works well.
Make sure your character encoding is utf-8.
The snippet you showed works perfectly as expected.
For example if you are encoding your source files using windows-1252 it won't work.
The problem is that System.out.println is PrintWriter which converts the Arabic string into bytes using the default encoding; which presumably cannot handle the arabic characters. Try
System.out.write(str.getBytes("UTF-8"));
System.out.println();
Many modern operating systems use UTF-8 as default encoding which will support non-latin characters correctly. Windows is not one of those, with ANSI being the default in Western installations (I have not used Windows recently, so that may have changed). Either way, you should probably force the default character encoding for the Java process, irrespective of the platform.
As described in another Stackoverflow question (see Setting the default Java character encoding?), you'll need to changed the default as follows, for the Java process:
java -Dfile.encoding=UTF-8
Additionally, since you are running in IDE you may need to tell it to display the output in the indicated charset or risk corruption, though that is IDE specific and the exact instructions will depend on your IDE.
One other thing, is if you are reading or writing text files then you should always specify the expected character encoding, otherwise you will risk falling back to the platform default.
You need to set character set utf-8 for this.
at java level you can do:
Charset.forName("UTF-8").encode(myString);
If you want to do so at IDE level then you can do:
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.

Setting contet-length when using Writer?

I am working on some response wrapper. When OutputStream is used, i can determine the number of bytes. However, when Writer is used, I buffer the content as char[], that is used later for writing to the output.
Its a bit noobie question, but how to be sure the real content-length when using char[] (i want to set the header - i know that I don't have to, but i want)? I mean, I could convert chars to bytes using used encoding and then flush bytes, but would like to skip another conversion.
Any other idea?
It depends on the character encoding you are using:
If you are using an 8-bit encoding (such as Latin-1), and you only have mappable characters to convert, then the character count is the same as the byte count.
Otherwise, it is technically possible to work out how many bytes the character stream will encode to1, but there isn't a simple and efficient way to do it.
Yeah, but that is a point, that I do not know the encoding in front and user may use anything. I don't see any other pragmatical way of doing this besides converting to byte[] - which probably happens under the hood, i guess.
Actually, it should be easy to tell what the the encoding is. Call getCharacterEncoding() on the response wrapper. If you do that after getWriter has been called, then it will be the encoding that was used to create the writer. (At least, that is what the javadocs imply ...)
1 - For instance, with UTF-8 you would first implement a mapping of char values to Unicode code points, and then figure how many bytes are required to encode each code point. This can be done without copying, etc, but you would need to code it from the ground up ... unless you can find a 3rd-party library that does this.

Why no Java Charset setEncoding method?

I found this excellent SO question asking for the distinction between a character set vs. character encoding. And it makes sense: essentially the character set is the set of glyphs available for use, and its respective encoding is how each glyph translates to and from binary.
I then went to the Java 7 SE Charset API doc and was surprised only to see a getEncoding() method but no respective setter. So it seems that, at least in Java land, every character set gets "bound" to a pre-configured encoding.
This got me thinking: why is there no setter here? Why does Java not allow the user to define what encoding to use for binding a set of characters to binary?
Along those same lines, what if Java doesn't support a particular character set/encoding? Is there a way to extend the JRE with custom sets/encodings?
Because what Java calls a Charset is what you call an encoding. The documentation of CharSet describes a Charset as:
A named mapping between sequences of sixteen-bit Unicode code units and sequences of bytes.
BTW, there is no getEncoding() method in Charset.
For the question:
Along those same lines, what if Java doesn't support a particular
character set/encoding? Is there a way to extend the JRE with custom
sets/encodings?
Java has support for pretty much any encoding you might want: http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html

Java safeguards for when UTF-16 doesn't cut it

My understanding is that Java uses UTF-16 by default (for String and char and possibly other types) and that UTF-16 is a major superset of most character encodings on the planet (though, I could be wrong). But I need a way to protect my app for when it's reading files that were generated with encodings (I'm not sure if there are many, or none at all) that UTF-16 doesn't support.
So I ask:
Is it safe to assume the file is UTF-16 prior to reading it, or, to maximize my chances of not getting NPEs or other malformed input exceptions, should I be using a character encoding detector like JUniversalCharDet or JCharDet or ICU4J to first detect the encoding?
Then, when writing to a file, I need to be sure that a characte/byte didn't make it into the in-memory object (the String, the OutputStream, whatever) that produces garbage text/characters when written to a string or file. Ideally, I'd like to have some way of making sure that this garbage-producing character gets caught somehow before making it into the file that I am writing. How do I safeguard against this?
Thanks in advance.
Java normally uses UTF-16 for its internal representation of characters. n Java char arrays are a sequence of UTF-16 encoded Unicode codepoints. By default char values are considered to be Big Endian (as any Java basic type is). You should however not use char values to write strings to files or memory. You should make use of the character encoding/decoding facilities in the Java API (see below).
UTF-16 is not a major superset of encodings. Actually, UTF-8 and UTF-16 can both encode any Unicode code point. In that sense, Unicode does define almost any character that you possibly want to use in modern communication.
If you read a file from disk and asume UTF-16 then you would quickly run into trouble. Most text files are using ASCII or an extension of ASCII to use all 8 bits of a byte. Examples of these extensions are UTF-8 (which can be used to read any ASCII text) or ISO 8859-1 (Latin). Then there are a lot of encodings e.g. used by Windows that are an extension of those extensions. UTF-16 is not compatible with ASCII, so it should not be used as default for most applications.
So yes, please use some kind of detector if you want to read a lot of plain text files with unknown encoding. This should answer question #1.
As for question #2, think of a file that is completely ASCII. Now you want to add a character that is not in the ASCII. You choose UTF-8 (which is a pretty safe bet). There is no way of knowing that the program that opens the file guesses correctly guesses that it should use UTF-8. It may try to use Latin or even worse, assume 7-bit ASCII. In that case you get garbage. Unfortunately there are no smart tricks to make sure this never happens.
Look into the CharsetEncoder and CharsetDecoder classes to see how Java handles encoding/decoding.
Whenever a conversion between bytes and characters takes place, Java allows to specify the character encoding to be used. If it is not specified, a machine dependent default encoding is used. In some encodings the bit pattern representing a certain character has no similarity with the bit pattern used for the same character in UTF-16 encoding.
To question 1 the answer is therefore "no", you cannot assume the file is encoded in UTF-16.
It depends on the used encoding which characters are representable.

What could be the possible consequences of default encoding to UTF-8 for a String to Stream conversion?

I need to convert Strings obtained from some API's to InputStream consumed by other API's. The only way is that I convert the String to Stream without knowing the exact encoding. So I assume it to be UTF-8 and it works fine for now. However I would like to know what could be a better solution for this given that I have no way of identifying the the encoding of the source of the string.
There is no good solution to the problem of not knowing the encoding.
Because of this, you must demand that the encoding be explicitly specified, or else use one single agreed-upon encoding that is strictly adhered to.
Also, make sure you use the rare form of the contructor to InputStreamReader that condescends to raise an exception on an encoding error. That is InputStreamReader(InputStream in, CharsetDecoder dec). The other three are either broken or else infelicitously designed depending on your point of view or purposes, because they suppress encoding errors and render your program unreliable and nonportable.
Be very careful about missing errors, especially when you do not know for sure what you are getting — and even if you think you do :).
The possible consequences of applying the incorrect encoding is getting the wrong data out the other end.
The specific consequences will depend on the specific encodings. For example, if you receive a stream of ISO-8859-1 characters, and try to decode using UTF-8, you'll probably get errors due to incorrect sequences. If you start with UTF-16 and assume that it's ISO-8859-1, you'll get twice as many characters as you expect, and every other one will be garbage.
Encodings are not a property of Strings in Java, they're only relevant when you convert between Strings and bytes. If those APIs give you Strings, there is only one point where your program needs to use an encoding, which is when you convert the String back to bytes to be returned by the InputStream. And those "other APIs" of course need to know which encoding to use if they're going to interpret the contents as text data.
To add to the other answers, your deployed application will no longer be portable between Windows and Linux, since these usually have different default encodings.

Categories