The difference between InputStream and InputStreamReader when reading multi-byte characters

The difference between InputStream and InputStreamReader when reading multi-byte characters - java

The difference between InputStream and InputStreamReader is that InputStream reads as byte, while InputStreamReader reads as char. For example, if the text in a file is abc,then both of them work fine. But if the text is a你们, which is composed of an a and two Chinese characters, then the InputStream does not work.
So we should use InputStreamReader, but my question is:
How does InputStreamReader recognize characters?
a is one byte, but a Chinese character is two bytes. Does it read a as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader read it as two bytes?

An InputStream reads raw octet (8 bit) data. In Java, the byte type is equivalent to the char type in C. In C, this type can be used to represent character data or binary data. In Java, the char type shares greater similarities with the C wchar_t type.
An InputStreamReader then will transform data from some encoding into UTF-16. If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC. When you pass the InputStream to InputStreamReader with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC.
The character encoding API in Java contains the algorithms to perform this transformation. You can find a list of encodings supported by the Oracle JRE here. The ICU project is a good place to start if you want to understand the internals of how this works in practice.
As Alexander Pogrebnyak points out, you should almost always provide the encoding explicitly. byte-to-char methods that do not specify an encoding rely on the JRE default, which is dependent on operating systems and user settings.

You have to give reader a hint, by providing a character set that your binary file is written in. E.g
Reader reader =
new InputStreamReader(
new FileInputStream( "/path/to/file" ),
"UTF-8" // most likely that the encoding of the file
)
Without a hint it will use your platform default encoding, which in many cases is not what you want.
This link has a nice explanation of encodings: http://www.joelonsoftware.com/articles/Unicode.html

Related

Is byte stream encodes byte to characters or only operates on bytes?

We have byte and character stream, If you read some examples from internet you can find that byte stream only operates on bytes and nothing more.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
And then why we need encoding in byte stream ?
Some popular websites did not help me.

Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
Everything in a typical modern computer has to be represented in bytes: a file holds a sequence of bytes, a network connection lets you send a sequence of bytes, a pointer identifies the location of a byte in memory, and so on. So a byte stream — an InputStream or OutputStream or the like — provides basic processing to let you read or write a sequence of bytes, no matter what kind of data is being represented by those bytes. The data might be text encoded as UTF-8 or UTF-16 or some other encoding, or it might be an image in a GIF or PNG or JPEG or other format, or it might be audio data or video data or a PDF or a Word document or . . . well, you get the idea.
A character stream — a Reader or Writer — provides a higher level of processing specifically for text data, so that you don't need to worry about the specific bytes being used to represent the characters, you just need to worry about the characters themselves. You just need to tell the character stream which character encoding to use (or let it use an appropriate default), and it can handle the rest from there.
But there's one big complication: Java didn't introduce this distinction until version 1.1, and because Java aims for a very high degree of backward-compatibility, there are some classes that survive from version 1.0 that kind of straddle the line. In particular, there is a PrintStream class that extends OutputStream and adds special 'print' methods that take more convenient types, such as String, and handle the character encoding internally. That PrintStream class has been there since version 1.0, and is still in wide use, especially because System.out and System.err are instances of it. (In theory, we should be using PrintWriter instead.)
And then why we need encoding in byte stream ?
We need a character encoding in whatever layer is converting between character sequences and byte sequences. Normally that layer is separate from the byte stream, but as I mentioned above, there are some holdovers from version 1.0 that handle the conversion themselves, which means they need to know which encoding to use.

It is a fundamentally quite straightforward system, but due to some required existing knowledge and possible interactions of several parts it can be confusing.
Let's put down some fundamental truths/axioms:
a InputStream is fundamentally about reading bytes from somewhere.
a OutputStream is fundamentally about writing bytes to somewhere.
Reader/Writer are the equivalent of those two for chars/String/text.
In the Java world, as long as you handle only String (or its related types like StringBuilder, ...) you don't need to care about encoding. It will always look like UTF-16, but you might as well pretend no encoding happens.
if you only ever handle byte[] (and related types like ByteBuffer) then you also don't need to care about encoding.
the encoding only ever comes into play when you want to cross over from the byte[] world to the String world (or the other way around).
So some Writer classes like OutputStreamWriter take a Charset to construct. And that's precisely because it's one of those borders that I mention in the last point above: It's handling both String and byte[] (indirectly), because it is a Writer that writes to a OutputStream and for that to work it will need to convert the String that gets written to it into a byte[] that it can forward to the OutputStream.
Other Writer (such as StringWriter) don't transfer data between those two world: it takes in String and produces String, so no conversion is necessary.
On the other side a ByteArrayInputStream is an InputStream that reads from a byte[], so again: both the input and the output live in "the same world", so no conversion is necessary and thus no Charset parameter exists.
tl;dr the "purity" of InputStream/OutputStream/Reader/Writer exists as long as you look only at those interfaces. When you look at specific implementations some of those will need to convert from the text world to the binary world (or vice versa) and those implementations will need to handle both worlds.

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.

There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.

You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.

You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

Platform Dependent Encoding issues in Java

Noticed this behavior while troubleshooting a file generation issue in a piece of java code that's moved from AIX to LINUX sever.
Charset.defaultCharset();
returns ISO-8859-1 on AIX, UTF-8 on Linux, and windows-1252 on my Windows 7. With that said, I am trying to figure out why on the Linux machine, nlength = 24 (3 bytes per alphanumeric character) whereas on AIX and Windows it is 8.
String inString = "ABC12345";
byte[] ebcdicByte = new byte[inString.length()];
System.out.println("Length:"+inString.getBytes("Cp1047").length);
ebcdicByte = inString.getBytes("Cp1047").);
String ebcdicString = new String( ebcdicByte);
int nlength = ebcdicString.getBytes().length;

You are misunterstanding things.
This is Java.
There are bytes. There are chars. And there is the default encoding.
When translating from bytes to chars, you have to decode.
When translating from chars to bytes, you have to encode.
And of course, apart from very limited charsets you will never have a 1-1 char-byte mapping.
If you see problems with encoding/decoding, the cause is pretty simple: somewhere in your code (with luck, in only one place; if not lucky, in several places) you failed to specify the charset to use when decoding and encoding.
Also note that by default, the encoding/decoding behaviour on failure it to replace unmappable char/byte sequences.
All this to say: a String does not have an encoding. Sure, it is a series of chars and a char is a primitive type; but it could just as well have been a stream of carrier pigeons, the two basic processes remain the same: you need to decode from bytes and you need to encode to bytes; if either part fails you end with meaningless byte sequences/mutant carrier pigeons.

Building on fge's answer...
Your observation is occurring because new String(ebcdicByte) and ebcdicString.getBytes() use the platform's default charset.
ISO-8859-1 and windows-1252 are one-byte charsets. In those charsets, one byte always represents one character. So in AIX and Windows, when you do new String(ebcdicByte), you will always get a String whose character count is identical to your byte array's length. Similarly, converting a String back to bytes will use a one-to-one mapping.
But in UTF-8, one character does not necessarily correspond to one byte. In UTF-8, bytes 0 through 127 are single-byte representations of characters, but all other values are part of a multi-byte sequence.
However, not just any sequence of bytes with their high bit set is a valid UTF-8 sequence. If you give an UTF-8 decoder a sequence of bytes that isn't a properly encoded UTF-8 byte sequence, it is considered malformed. new String will simply map malformed sequences to a special default character, usually "�" ('\ufffd'). That behavior can be changed by explicitly creating your own CharsetDecoder and calling its onMalformedInput method, rather than just relying on new String(byte[]).
So, the ebcdicByte array contains this EBCDIC representation of "ABC12345":
C1 C2 C3 F1 F2 F3 F4 F5
None of those are valid UTF-8 byte sequences, so ebcdicString ends up as "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd" which is "��������".
Your last line of code calls ebcdicString.getBytes(), which again does not specify a character set, which means the default charset will be used. Using UTF-8, "�" gets encoded as three bytes, EF BF BD. Since there are eight of those in ebcdicString, you get 3×8=24 bytes.

You have to specify the charset in the second to last line.
String ebcdicString = new String( ebcdicByte,"Cp1047");
as already pointed out, you always have to specify the charset when encoding/decoding.

When should I use InputStreamReader and OutputStreamWriter?

From the Java Tutorial site, we know InputStreamReader and OutputStreamWriter can convert streams between bytes and characters.
InputStreamReader converts bytes read from input to characters, while OutputStreamWriter converts characters to bytes to output.
But when should I use this two classes?
We have Inputstream/OutputStream input/output byte by byte, and Reader/Writer input/output character by character.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly? Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
EDIT:
When do I need to convert streams between bytes and characters using InputStreamReader and OutputStreamWriter?
EDIT:
Under which circumstances should I care about encoding scheme?

To understand the purpose of this, you need to get the following firmly into your mind. In Java char and String are for "text" expressed as Unicode, and byte or byte[] are for binary data. Bytes are NOT text. Bytes can represent encoded text ... but they have to be decoded before you can use the char and String types on them.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly?
(InputStreamReader is a subclass of Reader, so it not a case of "either ... or ...".)
The purpose of the InputStreamReader is to adapt an InputStream to a Reader. This adapter takes care of decoding the text from bytes to chars which contain Unicode codepoints1.
So you would use it when you have an existing InputStream (e.g. from a socket) ... or when you need more control over the selection of the encoding scheme. (Re the latter - you can open a file directly using FileReader, but that implicitly uses the default platforming encoding for the file. By using FileInputStream -> InputStreamReader you can specify the encoding scheme explicitly.)
Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
Its encodings again. If you want write text to an OUtputStream, you have to encode it according to some encoding scheme; e.g.
os.write(str.getBytes("UTF-8"));
By using a Writer, you move the encoding into the output pipeline where it is less obtrusive, and can typically be done more efficiently.
1 - or more strictly, a 16-bit representation of Unicode codepoints.

Reader/Writer give API to read/write the String literals into the stream. Where as Inputstream/OutputStream doesn't provide read/write of String literals, instead they read/write byte by byte.
So If your program needs to read/write String, then I advice using Reader/Writer for simplicity.
Also, Reader/Writer use InputStream/OutputStream internally, so Streams read/write little faster if used directly

Why character streams?

I understand that Java character streams wrap byte streams such that the underlying byte stream is interpreted as per the system default or an otherwise specifically defined character set.
My systems default char-set is UTF-8.
If I use a FileReader to read in a text file, everything looks normal as the default char-set is used to interpret the bytes from the underlying InputStreamReader. If I explicitly define an InputStreamReader to read the UTF-8 encoded text file in as UTF-16, everything obviously looks strange. Using a byte stream like FileInputStream and redirecting its output to System.out, everything looks fine.
So, my questions are;
Why is it useful to use a character stream?
Why would I use a character stream instead of directly using a byte stream?
When is it useful to define a specific char-set?

Code that deals with strings should only "think" in terms of text - for example, reading an input source line by line, you don't want to care about the nature of that source.
However, storage is usually byte-oriented - so you need to create a conversion between the byte-oriented view of a source (encapsulated by InputStream) and the character-oriented view of a source (encapsulated by Reader).
So a method which (say) counts the lines of text in an input source should take a Reader parameter. If you want to count the lines of text in two files, one of which is encoded in UTF-8 and one of which is encoded in UTF-16, you'd create an InputStreamReader around a FileInputStream for each file, specifying the appropriate encoding each time.
(Personally I would avoid FileReader completely - the fact that it doesn't let you specify an encoding makes it useless IMO.)

An InputStream reads bytes, while a Reader reads characters. Because of the way bytes map to characters, you need to specify the character set (or encoding) when you create an InputStreamReader, the default being the platform character set.

When you are reading/writing text which contains characters which could be > 127 , use a char stream. When you are reading/writing binary data use a byte stream.
You cna read text as binary if you wish, but unless you make alot of assumptions it rarely gains you much.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.