first, I know the difference between character and byte.
character is a signature or remark of something("A", "中" or other), byte is a concrete size in computer. And the size of a character in computer depends on the encoding style.
But what exactly is a character stream and a byte stream? what's the specific type they stand for? A byte stream is a stream of bytes? if so, what is a stream of character? My last question is, what type of stream does TCP transport?
Character Stream is a higher level concept than Byte Stream. A Character Stream is, effectively, a Byte Stream that has been wrapped with logic that allows it to output characters from a specific encoding; as opposed to one having to read bytes and decode the characters they represent.
An InputStream reads bytes, and a Reader reads characters.
Everything over TCP will natively be in bytes. If you know that the byte stream is representing characters, you can use an InputStreamReader to use the InputStream as a Reader.
TCP transports bytes of course. What these bytes represent is up to the protocol.
You can read about the relation between character and byte streams here: http://docs.oracle.com/javase/tutorial/i18n/text/stream.html
Practically, a character stream is an application-side abstraction over a byte stream, allowing to read/write bytes into or from characters using various encodings.
Have a look at this :
Character Streams versus Byte Streams
Character and Byte Streams
and i assume TCP transport packets, stream of bytes.
characterstream classes in java are used to handle character'sinput and output for ex-hadles unicode whereas bytestream classes are used to handle input and output of bytes i.e ascii codes only.the former was used in java 1.0 version whereas later is used in java 1.1
Related
We have byte and character stream, If you read some examples from internet you can find that byte stream only operates on bytes and nothing more.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
And then why we need encoding in byte stream ?
Some popular websites did not help me.
Once i read that both streams encodes bytes to characters depending on encoding, like if it’s byte stream then utf-8, character stream utf-16. So both of them encodes bytes to characters, if this's true why everywhere is written that it operates on bytes only. Byte stream can read data except bytes and then just converts to bytes?
Everything in a typical modern computer has to be represented in bytes: a file holds a sequence of bytes, a network connection lets you send a sequence of bytes, a pointer identifies the location of a byte in memory, and so on. So a byte stream — an InputStream or OutputStream or the like — provides basic processing to let you read or write a sequence of bytes, no matter what kind of data is being represented by those bytes. The data might be text encoded as UTF-8 or UTF-16 or some other encoding, or it might be an image in a GIF or PNG or JPEG or other format, or it might be audio data or video data or a PDF or a Word document or . . . well, you get the idea.
A character stream — a Reader or Writer — provides a higher level of processing specifically for text data, so that you don't need to worry about the specific bytes being used to represent the characters, you just need to worry about the characters themselves. You just need to tell the character stream which character encoding to use (or let it use an appropriate default), and it can handle the rest from there.
But there's one big complication: Java didn't introduce this distinction until version 1.1, and because Java aims for a very high degree of backward-compatibility, there are some classes that survive from version 1.0 that kind of straddle the line. In particular, there is a PrintStream class that extends OutputStream and adds special 'print' methods that take more convenient types, such as String, and handle the character encoding internally. That PrintStream class has been there since version 1.0, and is still in wide use, especially because System.out and System.err are instances of it. (In theory, we should be using PrintWriter instead.)
And then why we need encoding in byte stream ?
We need a character encoding in whatever layer is converting between character sequences and byte sequences. Normally that layer is separate from the byte stream, but as I mentioned above, there are some holdovers from version 1.0 that handle the conversion themselves, which means they need to know which encoding to use.
It is a fundamentally quite straightforward system, but due to some required existing knowledge and possible interactions of several parts it can be confusing.
Let's put down some fundamental truths/axioms:
a InputStream is fundamentally about reading bytes from somewhere.
a OutputStream is fundamentally about writing bytes to somewhere.
Reader/Writer are the equivalent of those two for chars/String/text.
In the Java world, as long as you handle only String (or its related types like StringBuilder, ...) you don't need to care about encoding. It will always look like UTF-16, but you might as well pretend no encoding happens.
if you only ever handle byte[] (and related types like ByteBuffer) then you also don't need to care about encoding.
the encoding only ever comes into play when you want to cross over from the byte[] world to the String world (or the other way around).
So some Writer classes like OutputStreamWriter take a Charset to construct. And that's precisely because it's one of those borders that I mention in the last point above: It's handling both String and byte[] (indirectly), because it is a Writer that writes to a OutputStream and for that to work it will need to convert the String that gets written to it into a byte[] that it can forward to the OutputStream.
Other Writer (such as StringWriter) don't transfer data between those two world: it takes in String and produces String, so no conversion is necessary.
On the other side a ByteArrayInputStream is an InputStream that reads from a byte[], so again: both the input and the output live in "the same world", so no conversion is necessary and thus no Charset parameter exists.
tl;dr the "purity" of InputStream/OutputStream/Reader/Writer exists as long as you look only at those interfaces. When you look at specific implementations some of those will need to convert from the text world to the binary world (or vice versa) and those implementations will need to handle both worlds.
From the Java Tutorial site, we know InputStreamReader and OutputStreamWriter can convert streams between bytes and characters.
InputStreamReader converts bytes read from input to characters, while OutputStreamWriter converts characters to bytes to output.
But when should I use this two classes?
We have Inputstream/OutputStream input/output byte by byte, and Reader/Writer input/output character by character.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly? Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
EDIT:
When do I need to convert streams between bytes and characters using InputStreamReader and OutputStreamWriter?
EDIT:
Under which circumstances should I care about encoding scheme?
To understand the purpose of this, you need to get the following firmly into your mind. In Java char and String are for "text" expressed as Unicode, and byte or byte[] are for binary data. Bytes are NOT text. Bytes can represent encoded text ... but they have to be decoded before you can use the char and String types on them.
So when using InputStreamReader to input characters from byte stream, why not just use Reader class (or its sub classes) to read character directly?
(InputStreamReader is a subclass of Reader, so it not a case of "either ... or ...".)
The purpose of the InputStreamReader is to adapt an InputStream to a Reader. This adapter takes care of decoding the text from bytes to chars which contain Unicode codepoints1.
So you would use it when you have an existing InputStream (e.g. from a socket) ... or when you need more control over the selection of the encoding scheme. (Re the latter - you can open a file directly using FileReader, but that implicitly uses the default platforming encoding for the file. By using FileInputStream -> InputStreamReader you can specify the encoding scheme explicitly.)
Why not use OutputStream instead of OutputStreamWriter to write bytes directly?
Its encodings again. If you want write text to an OUtputStream, you have to encode it according to some encoding scheme; e.g.
os.write(str.getBytes("UTF-8"));
By using a Writer, you move the encoding into the output pipeline where it is less obtrusive, and can typically be done more efficiently.
1 - or more strictly, a 16-bit representation of Unicode codepoints.
Reader/Writer give API to read/write the String literals into the stream. Where as Inputstream/OutputStream doesn't provide read/write of String literals, instead they read/write byte by byte.
So If your program needs to read/write String, then I advice using Reader/Writer for simplicity.
Also, Reader/Writer use InputStream/OutputStream internally, so Streams read/write little faster if used directly
Can someone explain me the difference between OutputStream and Writer? Which of these classes should I work with?
Streams work at the byte level, they can read (InputStream) and write (OutputStream) bytes or list of bytes to a stream.
Reader/Writers add the concept of character on top of a stream. Since a character can only be translated to bytes by using an Encoding, readers and writers have an encoding component (that may be set automatically since Java has a default encoding property). The characters read (Reader) or written (Writer) are automatically converted to bytes by the encoding and sent to the stream.
OutputStream classes writes to the target byte by byte where as Writer classes writes to the target character by character
An OutputStream is a stream that can write information. This is fairly general, so there are specialized OutputStream for special purposes like writing to files. A stream can only write arrays of bytes.
Writers provide more flexibility in that they can write characters and even strings while taking a special encoding into account.
Which one to take is really a matter of what you want to write. If you do have bytes already, you can use the stream directly. If you have characters or strings, you either need to convert them to bytes yourself if you want to write them to a stream, or you need to use a Writer which does that job for you.
OutputStream uses bare bytes, whereas Writer uses encoded charaters.
The Reader/Writer class hierarchy is character-oriented, and the Input Stream/Output Stream class hierarchy is byte-oriented.
Basically there are two types of streams.Byte streams that are used to handle stream of bytes and character streams for handling streams of characters.In byte streams input/output streams are the abstract classes at the top of hierarchy,while writer/reader are abstract classes at the top of character streams hierarchy.
More details here
Cheers!!!
Question
What is the simplest way to append a byte to a StringBuffer (i.e. cast a byte to a char) and specify the character encoding used (ASCII, UTF-8, etc)?
Context
I want to append a byte to a stringbuffer. Doing so requires casting the byte to a char:
myStringBuffer.append((char)nextByte);
However, the code above uses the default character encoding for my machine (which is MacRoman). Meanwhile, other components in the system/network require UTF-8. So I need to so something like:
try {
myStringBuffer.append(new String(new Byte[]{nextByte}, "UTF-8"));
} catch (UnsupportedEncodingException e) {
//handle error
}
Which, frankly, is pretty ugly.
Surely, there's a better way (other than breaking the same code into multiple lines)???????
The simple answer is 'no'. What if the byte is the first byte of a multi-byte sequence? Nothing would maintain the state.
If you have all the bytes of a logical character in hand, you can do:
sb.append(new String(bytes, charset));
But if you have one byte of UTF-8, you can't do this at all with stock classes.
It would not be terribly difficult to build a juiced-up StringBuffer that uses java.nio.charset classes to implement byte appending, but it would not be one or two lines of code.
Comments indicate that there's some basic Unicode knowledge needed here.
In UTF-8, 'a' is one byte, 'á' is two bytes, '丧' is three bytes, and '𝌎' is four bytes. The job of CharsetDecoder is to convert these sequences into Unicode characters. Viewed as a sequential operation over bytes, this is obviously a stateful process.
If you create a CharsetDecoder for UTF-8, you can feed it only byte at a time (in a ByteBuffer) via this method. The UTF-16 characters will accumulate in the output CharBuffer.
I think the error here is in dealing with bytes at all. You want to deal with strings of characters instead.
Just interpose a reader on the input and output stream to do the mapping between bytes and characters for you. Use the InputStreamReader(InputStream in, CharsetDecoder dec) form of the constructor for the input, though, so that you can detect input encoding errors via an exception. Now you have strings of characters instead of buffers of bytes. Put an OutputStreamWriter on the other end.
Now you no longer have to worry about bytes or encodings. It’s much simpler this way.
Please explain what Byte streams and Character streams are. What exactly do these mean? Is a Microsoft Word document Byte oriented or Character oriented?
Thanks
A stream is a way of sequentially accessing a file. A byte stream access the file byte by byte. A byte stream is suitable for any kind of file, however not quite appropriate for text files. For example, if the file is using a unicode encoding and a character is represented with two bytes, the byte stream will treat these separately and you will need to do the conversion yourself.
A character stream will read a file character by character. A character stream needs to be given the file's encoding in order to work properly.
Although a Microsoft Word Document contains text, it can't be accessed with a character stream (it isn't a text file). You need to use a byte stream to access it.
ByteStreams:
From oracle documentation page about byte streams:
Programs use byte streams to perform input and output of 8-bit bytes. All byte stream classes are descended from InputStream and OutputStream.
When to use:
Byte streams should only be used for the most primitive I/O
When not to use:
You should not use Byte stream to read Character streams
e.g. To read a text file
Character Streams:
From oracle documentation page about character streams:
The Java platform stores character values using Unicode conventions. Character stream I/O automatically translates this internal format to and from the local character set.
All character stream classes are descended from Reader and Writer.
Character streams are often "wrappers" for byte streams. The character stream uses the byte stream to perform the physical I/O, while the character stream handles translation between characters and bytes.
There are two general-purpose byte-to-character "bridge" streams: InputStreamReader and OutputStreamWriter.
When to use:
To read character streams either from Socket or File of characters
In Summary:
Byte stream reads and write a byte at a time. We must avoid the usage of byte stream while dealing with more sophisticated data.
Character Stream and other available streams should be used to handle sophisticated data.
1.Character oriented are tied to datatype. Only string type or character type can be read through it while byte oriented are not tied to any datatype, data of any datatype can be read(except string) just you have to specify it.
2.Character oriented reads character by character while byte oriented reads byte by byte
3.Character oriented streams use character encoding scheme(UNICODE) while byte oriented do not use any encoding scheme
4.Character oriented streams are also known as reader and writer streams
Byte oriented streams are known as data streams-Data input stream and Data output stream
Read this. It tells you about the difference between bytes and characters (as well as loads of other useful stuff)
A character stream will read a file character by character. The character streams are capable to read 16-bit characters (byte streams read 8-bit characters). Character streams are capable to translate implicitly 8-bit data to 16-bit data or vice versa. Character stream can support all types of character sets ASCII, Unicode, UTF-8, UTF-16 etc.But byte stream is suitable only for ASCII character set.The Java platform stores character values using Unicode conventions. Character stream I/O automatically translates this internal format to and from the local character set.
Unless you are working with binary data, such as image and sound files, you should use readers and writers to read and write information with character streams.