This is a basic question.
When I use a byte stream to write bytes to a file, am I creating a binary file?
For example: I use a byte stream to write text data to a notepad and when I open the notepad in a HEX viewer I see the corresponding hex value for each character. But why not the binary values (i.e 0s and 1s).
I also learned that using a dataoutput/input stream I read/write binary file.
I guess my confusion is with what does it mean to write bytes and what does it mean to write a binary data.
When I use a byte stream to write bytes to a file, am I creating a binary file?
You write the bytes as is, e.g., as the ones and zeroes they are. If these bytes represents characters then commonly no, it's just a text-file (everything is ones and zeroes after all). Otherwise the answers is it depends. The term binary file is missleading, but is usually referers to as a file which can contain arbitrary data.
when I open the notepad in a HEX viewer I see the corresponding hex value for each character. But why not the binary values
HEX is just another representation of bytes. The following three are equal
10 (Decimal value 10)
0xA (Hex value 10)
00001010 (Binary value 10)
A computer only stores binary values. But editors may choose to represent (display) those in another way, such as Hex or decimal form. Given enough bytes, it can even be represented as an image.
what does it mean to write bytes and what does it mean to write a binary data
Binary data means ones and zeroes, e.g., 00001010 which are 8 bits. 8 bits makes a byte.
The confusion could be caused by the application you are using. If you open something in HEX viewer, it should be represented in HEX not BIN.
The notions of "text" and "binary" files is mostly a notional understanding for you and me as "consumers" of the file. Strictly speaking, every file consists of 1's and 0's, and are thus all binary in the truest sense of the word. Hexadecimal representations, encodings for a particular character set, image file formats. You can spin up an array of 100 random bytes, spit it out to a file, and it's just as "binary" as any other file. Its all in the context of how the bytes are interpreted that makes the difference.
Here's an example. In old tried-and-true ACII, an upper-case "A" is encoded as decimal 65. You can represent that to people as 0x41 (hex) in a hex viewer, as an "A" an editor, but ultimately, you write that byte to a file, it's just a byte translated to a series of eight bits, 01000001.
Typically you are create a text file using Writer(s), and a binary file using other means (Streams, Channels, etc.). However, if your 'binary' file contains text and only text, it is a text file regardless.
Regarding hexadecimal format, that is merely a compact (preferred) way of viewing byte values.
Related
When an I/O stream manages 8-bit bytes of raw binary data, it is
called a byte stream. And, when the I/O stream manages 16-bit Unicode
characters, it is called a character stream.
Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?
In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.
A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.
In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.
WHen I use bufferbyte to input say integer 1 into the file, as .txt file,
Filechanel fc
(buffer.putInt(1))
fc.write(buffer).
when I open it with text editor, it does not appear to be 1 there, but it could be read by buffer correctly. But if I input character such as 'a', 'b' into the file, it appears well.
Is it nature that, when I input integers with bytebuffer, I cannot open it and see it clearly with eyes.
In order to see the integer you write to the file, you must first convert it to readable characters. For example, the integer 20000 is different from the string "20000". The integer is represented as 4 bytes in the file where as the individual characters that make up the readable string consist of at least 5 (in this example). Therefore, what you don't see when you write the integer value to the text file is the text editor trying to interpret the 4 bytes that make up the integer as 4 ascii characters (which may or may not be visible).
All computer files everywhere are just a sequence of bits and bytes.
Humans have come up with a way to represent human readable characters with bit sequences. These are known as character sets or character encodings. A very basic one is ASCII.
For example, the English upper-case character A is represented with the binary value
100 0001
the decimal value
65
or the hex value
41
When you write
(buffer.putInt(1))
fc.write(buffer) // assuming you've positioned the ByteBuffer
you're writing the decimal value 1 as binary to the file. The decimal value 1, as an int, in binary, is
00000000 00000000 00000000 00000001
since an int is 4 bytes.
When you open the file with a text editor (or any editor), it will see 4 bytes and try to give you the textual representation.
We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc. In some more example we see byte[] = String.getBytes(). A byte is just 8-bit value. Why we use byte[] to read binaries? What does a byte value contains after reading from file or string ?
We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc.
Because the operating system models files as sequences of bytes (or more precisely, as octets). The byte type is the most natural representation of an octet in Java.
Why we use byte[] to read binaries?
Same answer as before. Though, in reality, you can also read binary files in other ways as well; e.g. using DataInputStream.
What does a byte value contains after reading from file or string ?
In the first case, the byte that was in the file.
In the second case, you don't "read" bytes from a String. Rather, when you call the String.getBytes() you get the bytes that comprise the String's characters when they are encoded in a particular character-set. If you use the no-args getBytes() method you will get the JVM's default character-set / encoding. You can also supply an argument to choose a different encoding.
Java makes a clear distinction between bytes (8 bit) quantities and characters. Conceptually, Java characters are Unicode code points, and strings and similar representations of text are sequences of characters ... not sequences of bytes.
(Unfortunately, there is a "wrinkle" in the implementation. When Java was designed, the Unicode character space fitted into a 16 bits; i.e. there were <= 65536 recognized code points. Java was designed to match this ... and the char type was defined as a 16 bit unsigned integral type. But then Unicode was expanded to > 65536 code points, and Java was left with the awkward problem that some Unicode code points could not be represented using one char values. Instead, they are represented by a pair of char values ... a so-called surrogate pair ... and Java strings are effectively represented in UTF-16. For most common characters / character-sets, this doesn't matter. But if you need to deal with unusual characters / character-sets, the correct way to deal with Strings is to use the "codepoint" methods.)
The String is built upon bytes. The bytes are built upon bits. The bits are "physically" stored on the drive.
So instead of reading data from drive bit by bit it is read in larger portions which are bytes.
So the byte[] contains raw data. Raw data are equal to that what is stored on drive.
You eventually alaways read raw data, then you can apply a formatter what turns that bytes into characters and eventually into letters dispalyed on the screed if that is a txt file. If you dead with image out will read bytes that store the information about color instaed of character.
Because the smallest storage unit is byte.
I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.
I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.