the logic behind the diffrerence between fileInputStream and Scanner classes

the logic behind the diffrerence between fileInputStream and Scanner classes - java

I'm trying to understand the difference between Scanner.nextByte() and FileInputStream.read(). I read similar topics, but I didn't find the answer of my question. A similar question is asked in the topic : Scanner vs FileInputStream
Let me say what I understand :
Say that a .txt file includes
1
Then,
FileInputStream.read() will return 49
Scanner.nextByte() will return 1
If .txt file includes
a
FileInputStream.read() will return 97.
Scanner.nextByte() will throw a java.util.InputMismatchException.
In the answers which I gave the link, it said that:
FileInputStream.read() will evaluate the 1 as a byte, and return its
value: 49. Scanner.nextByte() will read the 1 and try to evaluate
it as an integer regular expression of radix 10, and give you: 1.
FileInputStream.read() will evaluate the a as a byte, and return its
value: 97. Scanner.nextByte() will read the a and try to evaluate
it as an integer regular expression of radix 10, and throw a
java.util.InputMismatchException.
But I didn't understand what they mean actually. Can you explain these words in simple words with more clear examples? I looked at ASCII table, character 1 corresponds to 49. The reason of FileInputStream.read() return 49 is because of that?
I'm totaly confused. Please explain me in simple words.

Files contain bytes. FileInputStream reads these bytes. So if a file contains one byte whose value is 49, stream.read() will return 49. If the file contains two identical bytes 49, calling read() twice will return 49, then 49.
Characters like 'a', '1' or 'Z' can be stored in files. To be stored in files, they first have to be transformed into bytes, because that's what files contain. There are various ways (called "character encodings") to transform characters to bytes. Some of them (like ASCII, ISO-8859-1 or UTF-8) transform the character '1' into the byte 49.
Scanner reads characters from a file. So it transforms the bytes in the file to characters (using the character encoding, but in the other direction: from bytes to characters). Some sequences of characters form decimal numbers, like for example '123', '-5265', or '1'. Some don't, like 'abc'.
When you call nextByte() on a Scanner, you ask the scanner to read the next sequence of characters (until the next white space or until the end of the file if there is no whitespace), then to check if this sequence of characters represents a valid decimal number, and to check that this decimal number fits into a byte (i.e. be a number between -128 and 127). If it is, the sequence of characters is parsed as a decimal number, stored int a byte, and returned.
So if the file contains the byte 49 twice, the sequence of characters read and parsed by nextByte() would be '11', which would be transformed into the byte 11.

Related

Store data in Byte array in java

I am trying to convert a string like "password" to hex values, then have it inside a long array, the loop working fine till reaching the value "6F" (hex value for o char) then I have an exception java.lang.NumberFormatException
String password = "password";
char array[] = password.toCharArray();
int index = 0;
for (char c : array) {
String hex = (Integer.toHexString((int) c));
data[index] = Long.parseLong(hex);
index++;
}
how can I store the 6F values inside Byte array, as the 6F is greater than 1 byte ?. Please help me on this

Long.parseLong parses decimal numbers. It turns the string "10" into the number 10. If the input is hex, that is incorrect - the string "10" is supposed to be turned into the number 16. The fix is to use the Long.parseLong(String input, int radix) method. the radix you want is 16, though writing that as 0x10 may be more readable - it's the same thing to the compiler, purely a personal style choice. Thus, Long.parseLong(hex, 0x10) is what you want.
Note that in practice char has numbers that go from 0 to 65535, which doesn't fit in bytes. In effect, you must put a marker down that passwords must not contain any characters that aren't ASCII characters (so no umlauts, snowmen, emoji, funny quotes, etc).
If you fail to check this, Integer.toHexString((int) c) will turn into something like 16F or worse (3 to 4 characters), and it may also turn into a single character.
More generally, converting from char c to a hex string, and then parse the hex string into a number, is completely pointless. It's turning 15 into "F" and then turning "F" into 15. If you just want to shove a char into a byte: data[index++] = (byte) c; is all you need - that is the only line you need in your for loop.
But, heed this:
This really isn't how you're supposed to do that!
What you're doing is converting character data to a byte array. This is not actually simple - there are only 256 possible bytes, and there are way more characters that folks have invented. Literally hundreds of thousands of them.
Thus, to convert characters to bytes or vice versa, you must apply an encoding. Encodings have wildly varying properties. The most commonly used encoding, however, is 'UTF-8'. It represent every unicode symbol, and has the interesting property that basic ASCII characters look the exact same. However, it has the downside that any given character is smeared out into 1, 2, 3, or even 4 bytes, depending on what character it is. Fortunately, java has plenty of tools for this, thus, you don't need to care. What you really want, is this:
byte[] data = password.getBytes(StandardCharsets.UTF8);
That's asking the string to turn itself into a byte array, using UTF8 encoding. That means "password" turns into the sequence '112 97 115 115 119 111 114 100' which is no doubt what you want, but you can also have as password, say, außgescheignet ☃, and that works too - it's turned into bytes, and you can get back to your snowman enabled password:
String in = "außgescheignet ☃";
byte[] data = in.getBytes(StandardCharsets.UTF8);
String andBackAgain = new String(data, StandardCharsets.UTF8);
assert in.equals(andBackAgain); // true
if you stick this in a source file, make sure you save it in whatever text editor you use to do this as UTF8, and that javac compiles it that way too (javac has an -encoding parameter to enforce this).
If you think this is going to cause issues on whatever you send this to, and you want to restrict it to what someone with a rather USA-centric view would call 'normal' characters, then you want the exact same code as showcased here, but use StandardCharsets.ASCII instead. Then, that line (password.getBytes(StandardCharsets.ASCII)) will flat out error if it includes non-ASCII characters. That's a good thing: Your infrastructure would not deal with it correctly, we just posited that in this hypothetical exercise. Throwing an exception early in the process on a relevant line is exactly what you want.

Unicode character length in bytes - always the same?

I defined a unicode character as a byte array:
private static final byte[] UNICODE_MEXT_LINE = Charsets.UTF_8.encode("\u0085").array();
At the moment byte array length is 3, is it safe to assume the length of the array is always 3 across platforms?
Thank you

It's safe to assume that that particular character will always be three bytes long, regardless of platform.
But unicode characters in UTF-8 can be one byte, two bytes, three bytes or even four bytes long, so no, you can't assume that if you convert any character to UTF-8 then it'll come out as three bytes.

That particular character will always be 3 bytes in length, but others will be different. Unicode characters are anywhere from 1-4 bytes long. The 8 in 'UTF-8' just means that it uses 8-bit code units.
The Wikipedia page on UTF-8 provides a pretty good overview of how that works. Basically, the first bits of the first byte tell you how many bytes long that character will be. For instance, if the first bit of the first byte is a 0 as in 01111111, then that means this character is only one byte long (in utf-8, these are the ascii characters). If the first bits are 110 as in 11011111, then that tells you that this character will be two bytes long. The chart in the Wikipedia page provides a good illustration of this.
There's also this question, which has some good answers as well.

How to see the contents of a file be written by BUfferbytes

WHen I use bufferbyte to input say integer 1 into the file, as .txt file,
Filechanel fc
(buffer.putInt(1))
fc.write(buffer).
when I open it with text editor, it does not appear to be 1 there, but it could be read by buffer correctly. But if I input character such as 'a', 'b' into the file, it appears well.
Is it nature that, when I input integers with bytebuffer, I cannot open it and see it clearly with eyes.

In order to see the integer you write to the file, you must first convert it to readable characters. For example, the integer 20000 is different from the string "20000". The integer is represented as 4 bytes in the file where as the individual characters that make up the readable string consist of at least 5 (in this example). Therefore, what you don't see when you write the integer value to the text file is the text editor trying to interpret the 4 bytes that make up the integer as 4 ascii characters (which may or may not be visible).

All computer files everywhere are just a sequence of bits and bytes.
Humans have come up with a way to represent human readable characters with bit sequences. These are known as character sets or character encodings. A very basic one is ASCII.
For example, the English upper-case character A is represented with the binary value
100 0001
the decimal value
65
or the hex value
41
When you write
(buffer.putInt(1))
fc.write(buffer) // assuming you've positioned the ByteBuffer
you're writing the decimal value 1 as binary to the file. The decimal value 1, as an int, in binary, is
00000000 00000000 00000000 00000001
since an int is 4 bytes.
When you open the file with a text editor (or any editor), it will see 4 bytes and try to give you the textual representation.

What's the difference between writeUTF and writeChars?

What's the difference between writeUTF and writeChars? (methods of ObjectOutputStream)
Further I have not found the corresponding readChars in ObjectInputStream.

writeUTF writes text in UTF-8 format encoding preceeded with text length, so readUTF knows how many characters to read from stream.
writeChars writes text as a sequence of 2-bytes chars with no length. To read it, we should use readChar method and we need to know how many chars were written.

writeChars() uses Unicode values
Writes every character in the string s, to the output stream, in
order, two bytes per character. If s is null, a NullPointerException
is thrown. If s.length is zero, then no characters are written.
Otherwise, the character s[0] is written first, then s1, and so on;
the last character written is s[s.length-1]. For each character, two
bytes are actually written, high-order byte first, in exactly the
manner of the writeChar method.
writeUTF() uses a slightly-modified version of UTF-8
Writes two bytes of length information to the output stream, followed
by the modified UTF-8 representation of every character in the string
s. If s is null, a NullPointerException is thrown. Each character in
the string s is converted to a group of one, two, or three bytes,
depending on the value of the character.

Storing characters as single bytes in java

I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html

The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.

Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.