the read method of FileInputStream will read 1 byte everytime, but who it is reading a character in a file, as the character size in java is (16 bit-2 Bytes). Is it because read method is native it will convert to 8-bits?
The read method of FileInputStream reader returns an int (equivalent to a byte), not a character. It's your responsibility to transform the result into a character.
As the javadoc of FileInputStream suggest: "for reading streams of characters, consider using FileReader".
The number of bytes to encode a character depends on the encoding of the file. For example, if the file is encoded with ASCII, each byte is a character, but if your file is encoded in UTF-8, a character is 1, 2, 3 or 4 bytes.
If you want more information about encoding, I would suggest reading the following article : The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Related
When an I/O stream manages 8-bit bytes of raw binary data, it is
called a byte stream. And, when the I/O stream manages 16-bit Unicode
characters, it is called a character stream.
Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?
In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.
A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.
In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.
I was studying Basic IO in java from https://docs.oracle.com/javase/tutorial/essential/io/bytestreams.html
ByteStream reads one byte (8 bits) at a time. But the char are stored in 16-bit unicode character
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
My question or confusion is if the character is stored in 16 bits then how can FileInputStream's read method which reads 8 bit at a time can read 16 bits (in which character is stored internally)?
In Java, strings are stored as arrays of char values, and a char is a 16-bit UTF-16 encoded value. Some Unicode characters (e.g. emoji's) are encoded as 2 char values.
Files are sequences of bytes, so to store a string in a file the string has to be encoded. The encoding is called a Character Set, and it depending on which character set you use, whether a character takes up 1 byte, 2 bytes, or more.
These days, files are often encoded in UTF-8, so a Unicode character takes between 1 and 4 bytes.
In Java, the InputStream / OutputStream classes are used for reading / writing bytes. To read / write characters, you need to use a Reader / Writer, which is usually done by wrapping an InputStream / OutputStream with an InputStreamReader / OutputStreamWriter, which you specify which character set should be used to convert bytes to / from char values.
My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.
I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.
and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.
How can I ensure that the content is not corrupted when storing in the java string variable?
The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.
The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.
Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.
If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.
A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.
Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.
However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.
I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.
Please explain what Byte streams and Character streams are. What exactly do these mean? Is a Microsoft Word document Byte oriented or Character oriented?
Thanks
A stream is a way of sequentially accessing a file. A byte stream access the file byte by byte. A byte stream is suitable for any kind of file, however not quite appropriate for text files. For example, if the file is using a unicode encoding and a character is represented with two bytes, the byte stream will treat these separately and you will need to do the conversion yourself.
A character stream will read a file character by character. A character stream needs to be given the file's encoding in order to work properly.
Although a Microsoft Word Document contains text, it can't be accessed with a character stream (it isn't a text file). You need to use a byte stream to access it.
ByteStreams:
From oracle documentation page about byte streams:
Programs use byte streams to perform input and output of 8-bit bytes. All byte stream classes are descended from InputStream and OutputStream.
When to use:
Byte streams should only be used for the most primitive I/O
When not to use:
You should not use Byte stream to read Character streams
e.g. To read a text file
Character Streams:
From oracle documentation page about character streams:
The Java platform stores character values using Unicode conventions. Character stream I/O automatically translates this internal format to and from the local character set.
All character stream classes are descended from Reader and Writer.
Character streams are often "wrappers" for byte streams. The character stream uses the byte stream to perform the physical I/O, while the character stream handles translation between characters and bytes.
There are two general-purpose byte-to-character "bridge" streams: InputStreamReader and OutputStreamWriter.
When to use:
To read character streams either from Socket or File of characters
In Summary:
Byte stream reads and write a byte at a time. We must avoid the usage of byte stream while dealing with more sophisticated data.
Character Stream and other available streams should be used to handle sophisticated data.
1.Character oriented are tied to datatype. Only string type or character type can be read through it while byte oriented are not tied to any datatype, data of any datatype can be read(except string) just you have to specify it.
2.Character oriented reads character by character while byte oriented reads byte by byte
3.Character oriented streams use character encoding scheme(UNICODE) while byte oriented do not use any encoding scheme
4.Character oriented streams are also known as reader and writer streams
Byte oriented streams are known as data streams-Data input stream and Data output stream
Read this. It tells you about the difference between bytes and characters (as well as loads of other useful stuff)
A character stream will read a file character by character. The character streams are capable to read 16-bit characters (byte streams read 8-bit characters). Character streams are capable to translate implicitly 8-bit data to 16-bit data or vice versa. Character stream can support all types of character sets ASCII, Unicode, UTF-8, UTF-16 etc.But byte stream is suitable only for ASCII character set.The Java platform stores character values using Unicode conventions. Character stream I/O automatically translates this internal format to and from the local character set.
Unless you are working with binary data, such as image and sound files, you should use readers and writers to read and write information with character streams.