We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc. In some more example we see byte[] = String.getBytes(). A byte is just 8-bit value. Why we use byte[] to read binaries? What does a byte value contains after reading from file or string ?
We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc.
Because the operating system models files as sequences of bytes (or more precisely, as octets). The byte type is the most natural representation of an octet in Java.
Why we use byte[] to read binaries?
Same answer as before. Though, in reality, you can also read binary files in other ways as well; e.g. using DataInputStream.
What does a byte value contains after reading from file or string ?
In the first case, the byte that was in the file.
In the second case, you don't "read" bytes from a String. Rather, when you call the String.getBytes() you get the bytes that comprise the String's characters when they are encoded in a particular character-set. If you use the no-args getBytes() method you will get the JVM's default character-set / encoding. You can also supply an argument to choose a different encoding.
Java makes a clear distinction between bytes (8 bit) quantities and characters. Conceptually, Java characters are Unicode code points, and strings and similar representations of text are sequences of characters ... not sequences of bytes.
(Unfortunately, there is a "wrinkle" in the implementation. When Java was designed, the Unicode character space fitted into a 16 bits; i.e. there were <= 65536 recognized code points. Java was designed to match this ... and the char type was defined as a 16 bit unsigned integral type. But then Unicode was expanded to > 65536 code points, and Java was left with the awkward problem that some Unicode code points could not be represented using one char values. Instead, they are represented by a pair of char values ... a so-called surrogate pair ... and Java strings are effectively represented in UTF-16. For most common characters / character-sets, this doesn't matter. But if you need to deal with unusual characters / character-sets, the correct way to deal with Strings is to use the "codepoint" methods.)
The String is built upon bytes. The bytes are built upon bits. The bits are "physically" stored on the drive.
So instead of reading data from drive bit by bit it is read in larger portions which are bytes.
So the byte[] contains raw data. Raw data are equal to that what is stored on drive.
You eventually alaways read raw data, then you can apply a formatter what turns that bytes into characters and eventually into letters dispalyed on the screed if that is a txt file. If you dead with image out will read bytes that store the information about color instaed of character.
Because the smallest storage unit is byte.
Related
//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.
My goal is to conserve space in my data store, which only accepts Strings.
Because a String in Java is a 16-bit array, I figure that in theory I should be able to convert my 8-byte long into a 4-char String, as both are represented by 8 bytes. (To be clear, I am not interested in making my long integer human-readable in base 10, I want to store it in as short of a String as possible.)
However, almost all the literature I have found on this is about converting to the 8-bit byte type, not the type char.
I could encode as UTF8. I am concerned this would mean I double the length of String, as each 8-bit byte is stored as a 16-bit char. This would defeat my whole purpose for compacting my data into a 64-bit medium in the first place.
private static final Charset UTF8_CHARSET = Charset.forName("UTF-8");
new String(ByteBuffer.allocate(8).putLong(value).array(), UTF8_CHARSET);
Is my concern correct that I would be wasting space, and if so, is there a way to not waste space?
char != int
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.
String != byte[] && char != int
Internally String objects are Unicode and encoded as UTF-16 no matter what their source is.
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set,
and several libraries implement the Unicode standard. The primitive
data type char in the Java programming language is an unsigned 16-bit
integer that can represent a Unicode code point in the range U+0000 to
U+FFFF, or the code units of UTF-16. The various types and classes in
the Java platform that represent character sequences - char[],
implementations of java.lang.CharSequence (such as the String class),
and implementations of java.text.CharacterIterator - are UTF-16
sequences.
String is internally represented by UTF-16
The character encodings like UTF-8 are only for interpreting or converting to/from a byte[].
Even if you write a custom CharsetProvider all that will do is encode/decode a byte[] externally, this will absolutely not change the fact that a String is internally represented by UTF-16, so what you want to do is kind of pointless.
Can't be done
Character is actually a 32 bit number, the Charset is just an encoding of that 32 bit number. UTF-8 can be 1, 2, 3 or 4 bytes for example, and UTF-16 is 2,4 bytes with a bit specifying if the next byte(s) is part of the same character or not.
I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.
This is a basic question.
When I use a byte stream to write bytes to a file, am I creating a binary file?
For example: I use a byte stream to write text data to a notepad and when I open the notepad in a HEX viewer I see the corresponding hex value for each character. But why not the binary values (i.e 0s and 1s).
I also learned that using a dataoutput/input stream I read/write binary file.
I guess my confusion is with what does it mean to write bytes and what does it mean to write a binary data.
When I use a byte stream to write bytes to a file, am I creating a binary file?
You write the bytes as is, e.g., as the ones and zeroes they are. If these bytes represents characters then commonly no, it's just a text-file (everything is ones and zeroes after all). Otherwise the answers is it depends. The term binary file is missleading, but is usually referers to as a file which can contain arbitrary data.
when I open the notepad in a HEX viewer I see the corresponding hex value for each character. But why not the binary values
HEX is just another representation of bytes. The following three are equal
10 (Decimal value 10)
0xA (Hex value 10)
00001010 (Binary value 10)
A computer only stores binary values. But editors may choose to represent (display) those in another way, such as Hex or decimal form. Given enough bytes, it can even be represented as an image.
what does it mean to write bytes and what does it mean to write a binary data
Binary data means ones and zeroes, e.g., 00001010 which are 8 bits. 8 bits makes a byte.
The confusion could be caused by the application you are using. If you open something in HEX viewer, it should be represented in HEX not BIN.
The notions of "text" and "binary" files is mostly a notional understanding for you and me as "consumers" of the file. Strictly speaking, every file consists of 1's and 0's, and are thus all binary in the truest sense of the word. Hexadecimal representations, encodings for a particular character set, image file formats. You can spin up an array of 100 random bytes, spit it out to a file, and it's just as "binary" as any other file. Its all in the context of how the bytes are interpreted that makes the difference.
Here's an example. In old tried-and-true ACII, an upper-case "A" is encoded as decimal 65. You can represent that to people as 0x41 (hex) in a hex viewer, as an "A" an editor, but ultimately, you write that byte to a file, it's just a byte translated to a series of eight bits, 01000001.
Typically you are create a text file using Writer(s), and a binary file using other means (Streams, Channels, etc.). However, if your 'binary' file contains text and only text, it is a text file regardless.
Regarding hexadecimal format, that is merely a compact (preferred) way of viewing byte values.
I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.