Why character streams? - java

I understand that Java character streams wrap byte streams such that the underlying byte stream is interpreted as per the system default or an otherwise specifically defined character set.
My systems default char-set is UTF-8.
If I use a FileReader to read in a text file, everything looks normal as the default char-set is used to interpret the bytes from the underlying InputStreamReader. If I explicitly define an InputStreamReader to read the UTF-8 encoded text file in as UTF-16, everything obviously looks strange. Using a byte stream like FileInputStream and redirecting its output to System.out, everything looks fine.
So, my questions are;
Why is it useful to use a character stream?
Why would I use a character stream instead of directly using a byte stream?
When is it useful to define a specific char-set?

Code that deals with strings should only "think" in terms of text - for example, reading an input source line by line, you don't want to care about the nature of that source.
However, storage is usually byte-oriented - so you need to create a conversion between the byte-oriented view of a source (encapsulated by InputStream) and the character-oriented view of a source (encapsulated by Reader).
So a method which (say) counts the lines of text in an input source should take a Reader parameter. If you want to count the lines of text in two files, one of which is encoded in UTF-8 and one of which is encoded in UTF-16, you'd create an InputStreamReader around a FileInputStream for each file, specifying the appropriate encoding each time.
(Personally I would avoid FileReader completely - the fact that it doesn't let you specify an encoding makes it useless IMO.)

An InputStream reads bytes, while a Reader reads characters. Because of the way bytes map to characters, you need to specify the character set (or encoding) when you create an InputStreamReader, the default being the platform character set.

When you are reading/writing text which contains characters which could be > 127 , use a char stream. When you are reading/writing binary data use a byte stream.
You cna read text as binary if you wish, but unless you make alot of assumptions it rarely gains you much.

Related

Is byte stream encodes byte to characters or only operates on bytes?

We have byte and character stream, If you read some examples from internet you can find that byte stream only operates on bytes and nothing more.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
And then why we need encoding in byte stream ?
Some popular websites did not help me.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
Everything in a typical modern computer has to be represented in bytes: a file holds a sequence of bytes, a network connection lets you send a sequence of bytes, a pointer identifies the location of a byte in memory, and so on. So a byte stream — an InputStream or OutputStream or the like — provides basic processing to let you read or write a sequence of bytes, no matter what kind of data is being represented by those bytes. The data might be text encoded as UTF-8 or UTF-16 or some other encoding, or it might be an image in a GIF or PNG or JPEG or other format, or it might be audio data or video data or a PDF or a Word document or . . . well, you get the idea.
A character stream — a Reader or Writer — provides a higher level of processing specifically for text data, so that you don't need to worry about the specific bytes being used to represent the characters, you just need to worry about the characters themselves. You just need to tell the character stream which character encoding to use (or let it use an appropriate default), and it can handle the rest from there.
But there's one big complication: Java didn't introduce this distinction until version 1.1, and because Java aims for a very high degree of backward-compatibility, there are some classes that survive from version 1.0 that kind of straddle the line. In particular, there is a PrintStream class that extends OutputStream and adds special 'print' methods that take more convenient types, such as String, and handle the character encoding internally. That PrintStream class has been there since version 1.0, and is still in wide use, especially because System.out and System.err are instances of it. (In theory, we should be using PrintWriter instead.)
And then why we need encoding in byte stream ?
We need a character encoding in whatever layer is converting between character sequences and byte sequences. Normally that layer is separate from the byte stream, but as I mentioned above, there are some holdovers from version 1.0 that handle the conversion themselves, which means they need to know which encoding to use.
It is a fundamentally quite straightforward system, but due to some required existing knowledge and possible interactions of several parts it can be confusing.
Let's put down some fundamental truths/axioms:
a InputStream is fundamentally about reading bytes from somewhere.
a OutputStream is fundamentally about writing bytes to somewhere.
Reader/Writer are the equivalent of those two for chars/String/text.
In the Java world, as long as you handle only String (or its related types like StringBuilder, ...) you don't need to care about encoding. It will always look like UTF-16, but you might as well pretend no encoding happens.
if you only ever handle byte[] (and related types like ByteBuffer) then you also don't need to care about encoding.
the encoding only ever comes into play when you want to cross over from the byte[] world to the String world (or the other way around).
So some Writer classes like OutputStreamWriter take a Charset to construct. And that's precisely because it's one of those borders that I mention in the last point above: It's handling both String and byte[] (indirectly), because it is a Writer that writes to a OutputStream and for that to work it will need to convert the String that gets written to it into a byte[] that it can forward to the OutputStream.
Other Writer (such as StringWriter) don't transfer data between those two world: it takes in String and produces String, so no conversion is necessary.
On the other side a ByteArrayInputStream is an InputStream that reads from a byte[], so again: both the input and the output live in "the same world", so no conversion is necessary and thus no Charset parameter exists.
tl;dr the "purity" of InputStream/OutputStream/Reader/Writer exists as long as you look only at those interfaces. When you look at specific implementations some of those will need to convert from the text world to the binary world (or vice versa) and those implementations will need to handle both worlds.

Find out encoding directly from an input stream [duplicate]

I'm facing a problem.
A file can be written in some encoding such as UTF-8, UTF-16, UTF-32, etc.
When I read a UTF-16 file, I use the code below:
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF16"));
How can I determine which encoding the file is in before I read the file ?
When I read UTF-8 encoded file using UTF-16 I can't read the characters correctly.
There is no good way to do that. The question you're asking is like determining the radix of a number by looking at it. For example, what is the radix of 101?
Best solution would be to read the data into a byte array. Then you can use String(byte[] bytes, Charset charset) to test it with multiple encodings, most likely to least likely.
You cannot. Which transformation format applies is usually determined by the first four bytes of the file (assuming a BOM). You cannot see those just from the outside.
You can read the first few bytes and try to guess the encoding.
If all else fails, try reading with different encodings until one works (no exception when decoding and it 'looks' OK).

What is the difference between OutputStream and Writer?

Can someone explain me the difference between OutputStream and Writer? Which of these classes should I work with?
Streams work at the byte level, they can read (InputStream) and write (OutputStream) bytes or list of bytes to a stream.
Reader/Writers add the concept of character on top of a stream. Since a character can only be translated to bytes by using an Encoding, readers and writers have an encoding component (that may be set automatically since Java has a default encoding property). The characters read (Reader) or written (Writer) are automatically converted to bytes by the encoding and sent to the stream.
OutputStream classes writes to the target byte by byte where as Writer classes writes to the target character by character
An OutputStream is a stream that can write information. This is fairly general, so there are specialized OutputStream for special purposes like writing to files. A stream can only write arrays of bytes.
Writers provide more flexibility in that they can write characters and even strings while taking a special encoding into account.
Which one to take is really a matter of what you want to write. If you do have bytes already, you can use the stream directly. If you have characters or strings, you either need to convert them to bytes yourself if you want to write them to a stream, or you need to use a Writer which does that job for you.
OutputStream uses bare bytes, whereas Writer uses encoded charaters.
The Reader/Writer class hierarchy is character-oriented, and the Input Stream/Output Stream class hierarchy is byte-oriented.
Basically there are two types of streams.Byte streams that are used to handle stream of bytes and character streams for handling streams of characters.In byte streams input/output streams are the abstract classes at the top of hierarchy,while writer/reader are abstract classes at the top of character streams hierarchy.
More details here
Cheers!!!

Readline() in Java does not handle Chinese characters properly

I have a text file with Chinese words written to a line. The line is surrounded with "\r\n", and written using fileOutputStream.write(string.getBytes()).
I have no problems reading lines of English words, my buffered reader parses it with readLine() perfectly. However, it recognizes the Chinese sentence as multiple lines, thus screwing up my programme flow.
Any solutions?
Using string.getBytes() encodes the String using the platform default encoding. That is rarely what you want, especially when you're trying to write characters that are not native to your current locale.
Specify the encoding instead (using string.getBytes("UTF-8"), for example).
A cleaner and more Java-esque way would be to wrap your OutputStream in an OutputStreamWriter like this:
Writer w = new OutputStreamWriter(out, "UTF-8");
Then you can simply call writer.write(string) and don't need to repeat the encoding each time you want to write a String.
And, as commented below, specify the same encoding when reading the file (using a Reader, preferably).
If you're outputting the text via fileOutputStream.write(string.getBytes()), you're outputting with the default encoding for the platform. It's important to ensure you're then reading with the appropriate encoding, and using methods that are encoding-aware. The problem won't be in your BufferedReader instance, but whatever Reader you have under it that's converting bytes into characters.
This article may be of use: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Which Class used for writing characters rather than bytes?

Which class should be used in situations that require writing characters rather than bytes?
Please take a look at java.io.Writer and subclasses.
PrintWriter will be useful
http://download.oracle.com/javase/1.4.2/docs/api/java/io/PrintWriter.html
An important thing to know about I/O in Java is that streams (InputStream and OutputStream etc.) are used for reading and writing binary data (you read or write bytes exactly as they are in the file), and readers and writers (Reader and Writer etc.) are for reading and writing characters.
Readers and writers are a layer on top of streams. A Reader interprets the bytes from an InputStream using a character encoding (such as UTF-8, ISO-8859-1, US-ASCII) to convert them into characters, and a Writer uses a character encoding to turn characters into bytes.

Categories