How to force UTF-16 while reading/writing in Java? - java

I see that you can specify UTF-16 as the charset via Charset.forName("UTF-16"), and that you can create a new UTF-16 decoder via Charset.forName("UTF-16").newDecoder(), but I only see the ability to specify a CharsetDecoder on InputStreamReader's constructor.
How so how do you specify to use UTF-16 while reading any stream in Java?

Input streams deal with raw bytes. When you read directly from an input stream, all you get is raw bytes where character sets are irrelevant.
The interpretation of raw bytes into characters, by definition, requires some sort of translation: how do I translate from raw bytes into a readable string? That "translation" comes in the form of a character set.
This "added" layer is implemented by Readers. Therefore, to read characters (rather than bytes) from a stream, you need to construct a Reader of some sort (depending on your needs) on top of the stream. For example:
InputStream is = ...;
Reader reader = new InputStreamReader(is, Charset.forName("UTF-16"));
This will cause reader.read() to read characters using the character set you specified. If you would like to read entire lines, use BufferedReader on top:
BufferedReader reader = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-16")));
String line = reader.readLine();

Related

Are `InputStream` and `Reader` essentially the same, and are `OutputStream` and `Writer` essentially the same?

In Java, InputStream and OutputStream deal with byte[], and Reader and Writer with char[].
Do their input or output byte[] and char[] essentially have the same values? (That is my impression, because a char and a byte in IO have the same value)
In other words, are InputStream and Reader essentially the same, and are OutputStream and Writer essentially the same?
They're not essentially the same, but they do the same sorts of things for different kinds of data.
InputStream and OutputStream work in bytes. You'd use them when dealing with non-textual information (such as an image).
Reader and Writer work in characters. You'd use them when dealing with textual information.
So "yes" and "no". :-) InputStream and Reader are both for reading information (a stream of bytes or a stream of characters, respectively), and OutputStream and Writer are both for writing information (a stream of bytes or a stream of characters, respectively). Which you use depends on what kind of data you're dealing with. The streams are byte-oriented. The readers/writers are character-oriented.
There are bridging classes between the two kinds of data:
InputStreamReader reads from an InputStream and converts bytes to characters using a CharSet (one provided explicitly or by name).
OutputStreamWriter does the converse: Converts characters to bytes (again via a CharSet) and writes the bytes to an OutputStream.
...but most Reader/Writer subclasses read from/write to sources/destinations that are already character-based, and so don't deal with bytes at all. For instance, StringReader reads characters from a string. Since the source (the string) is already character-based, the Reader doesn't ever deal with bytes, just characters.
Yes, you have the right idea. Standard classes InputStreamReader and OutputStreamWriter act as adapters from the byte stream interfaces to the character stream interfaces, requiring only that a Charset (typically UTF-8) is specified. That Charset will be used to convert the incoming bytes into Java's UTF-16 character type, so notably it is not true that the actual bytes read from an InputStream and Reader are always the same.
InputStream is typically used for reading data of any type, while Reader is only appropriate for reading text data.

Java.io Two ways to obtain buffered character stream from unbuffered byte one

I am switching to Java from c++ and now going through some of the documentation on Java IO. So if I want to make buffered character stream from unbuffered byte stream, I can do this in two ways:
Reader input1 = new BufferedReader(new InputStreamReader(new FileInputStream("Xanadu.txt")));
and
Reader input2 = new InputStreamReader(new BufferedInputStream(new FileInputStream("Xanadu.txt")));
So I can make it character and after this buffered or vise versa.
What is the difference between them and which is better?
Functionally, there is no difference. The two versions will behave the same way.
There is a likely to be difference in performance, with the first version likely to be a bit faster than the second version when you read characters from the Reader one at a time.
In the first version, an entire buffer full of data will be converted from bytes to chars in a single operation. Then each read() call on the Reader will fetch a character directly from the character buffer.
In the second version, each read() call on the Reader performs one or more read() calls on the input stream and converts only those bytes read to a character.
If I was going to implement this (precise) functionality, I would do it like this:
Reader input = new BufferedReader(new FileReader("Xanadu.txt"));
and let FileReader deal with the bytes-to-characters decoding under the hood.
There is a case for using an InputStreamReader, but only if you need to specify the character set for the bytes-to-characters conversion explicitly.

When should I use InputStreamReader and OutputStreamWriter?

From the Java Tutorial site, we know InputStreamReader and OutputStreamWriter can convert streams between bytes and characters.
InputStreamReader converts bytes read from input to characters, while OutputStreamWriter converts characters to bytes to output.
But when should I use this two classes?
We have Inputstream/OutputStream input/output byte by byte, and Reader/Writer input/output character by character.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly? Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
EDIT:
When do I need to convert streams between bytes and characters using InputStreamReader and OutputStreamWriter?
EDIT:
Under which circumstances should I care about encoding scheme?
To understand the purpose of this, you need to get the following firmly into your mind. In Java char and String are for "text" expressed as Unicode, and byte or byte[] are for binary data. Bytes are NOT text. Bytes can represent encoded text ... but they have to be decoded before you can use the char and String types on them.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly?
(InputStreamReader is a subclass of Reader, so it not a case of "either ... or ...".)
The purpose of the InputStreamReader is to adapt an InputStream to a Reader. This adapter takes care of decoding the text from bytes to chars which contain Unicode codepoints1.
So you would use it when you have an existing InputStream (e.g. from a socket) ... or when you need more control over the selection of the encoding scheme. (Re the latter - you can open a file directly using FileReader, but that implicitly uses the default platforming encoding for the file. By using FileInputStream -> InputStreamReader you can specify the encoding scheme explicitly.)
Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
Its encodings again. If you want write text to an OUtputStream, you have to encode it according to some encoding scheme; e.g.
os.write(str.getBytes("UTF-8"));
By using a Writer, you move the encoding into the output pipeline where it is less obtrusive, and can typically be done more efficiently.
1 - or more strictly, a 16-bit representation of Unicode codepoints.
Reader/Writer give API to read/write the String literals into the stream. Where as Inputstream/OutputStream doesn't provide read/write of String literals, instead they read/write byte by byte.
So If your program needs to read/write String, then I advice using Reader/Writer for simplicity.
Also, Reader/Writer use InputStream/OutputStream internally, so Streams read/write little faster if used directly

BufferedReader and InputStreamReader in Java

I recently started with Java and want to understand a java module of a large app. I came across this line of java code:
String line = (new BufferedReader(new InputStreamReader(System.in))).readLine();
What does this java code do. Is there a C/C++ equivalent of this?
System.in is the standard input.
InputStreamReader allows you to associate a stream that reads from the specified input (in this case the standard input), so now we have a stream.
BufferedReader is an "abstraction" to help you to work with streams. For example, it implements readLine instead of reading character by character until you find a '\n' to get the whole line. It just returns a String after this proccess.
So this line means: "Read a line from standard input and store it in line variable".
> What does this java code do:
String line is your string object
new BufferedReader().readLine() is the instance of a BufferedReader to read text from a character input stream; and readline() is a method it implements to read until a newline character.
new InputStreamReader() gives you a instance of an InputStreamReader which is the "bridge" between the standard in byte stream and the character stream which a BufferedReader wants.
System.in is the standard input (byte stream)
> Is there a C/C++ equivalent of this
Well... there's no language called C/C++... ;)
So I'll assume you wanted an answer for each of them.
In C, there are no "strings" you have to use a character array, but you can read data in to a character array from stdin with something like:
char input[100];
...
scanf("%99[^\n]", input);
or
fgets (input, 100 , stdin)
In C++, you'd use:
using namespace std;
string line;
getline(cin, line);
Your snippet uses a BufferedReader, chained to an InputStreamReader, to read aline from the standard input console and store it to the String line .
BufferedReader
Read text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders.
BufferedReader#readLine()
Read a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
InputStreamReader
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
System
The System class contains several useful class fields and methods. It cannot be instantiated.
Among the facilities provided by the System class are standard input, standard output, and error output streams; access to externally defined "properties"; a means of loading files and libraries; and a utility method for quickly copying a portion of an array.
System.in
The "standard" input stream. This stream is already open and ready to supply input data. Typically this stream corresponds to keyboard input or another input source specified by the host environment or user.
What the code does is just simply read a line from input stream. from pattern point of view, this is a decorator. As to using BufferedReader is aiming to improve IO performance.
An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
For top efficiency, we consider wrapping an InputStreamReader within a BufferedReader. For example:
BufferedReader in
= new BufferedReader(new InputStreamReader(System.in));

Trouble reading special characters (Java)

I'm making a chat client that uses special encryption. It has a problem reading letters like «, ƒ, ̕ from the input buffer.
Im reading them into a byte array and I tried using
Connection.getInputStream().read();
And also using
BufferedReader myInput = new BufferedReader(
new InputStreamReader(Connection.getInputStream()));
But there appears to be a problem as it displays them as square boxes.
You have to make sure that your InputStreamReader uses the same charset to decode the bytes into chars than the one used by the sender to encode chars into bytes. Look at the other constructors of InputStreamReader.
You must also make sure that the font you're using to display the chars supports your special characters.
Set the correct encoding on the stream through new InputStreamReader(..,"utf-8") or whatever your input is.
Conver byte array to String specifying Character set.
String data = new String(byte[], "UTF-8");
make sure that displaying font support UTF-8 or your specified encoding charset.
You can try using a DataInputStream and the readChar() method.
DataInputStream in = new DataInputStream(myinput);
//where muinput is your BufferedInputStream.
char c = in.readChar();
should do what you want.

Categories