Size and Size on disk of a .txt file - java

Opened up a new file in Notepad and inserted the sentence without the quotes, "Four score and seven years ago" in it.
Four 4 characters
score 5 characters
and 3 characters
seven 5 characters
years 5 characters
ago 3 characters
TOTAL : 25 + 5 spaces = 30 characters.
You will find that the file has a size of 30 bytes on disk: 1 byte for each character.
Saved the file to disk under the name gettingSize.txt.
Then look at the size of the file.
As a rule, Each character consumes a byte.
Size : 30 bytes
Size on Disk : 4.00 KB (4,096 bytes)
The below paragraphs are copy pasted from a pdf.
If you were to look at the file as a computer looks at it, you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character (see below). So on disk, the numbers for the file look like this:
F o u r a n d s e v e n
70 111 117 114 32 97 110 100 32 115 101 118 101 110
By looking in the ASCII table, you can see a one-to-one correspondence between each character and the ASCII code used. Note the use of 32 for a space -- 32 is the ASCII code for a space. We could expand these decimal numbers out to binary numbers (so 32 = 00100000) if we wanted to be technically correct -- that is how the computer really deals with things.
1) i know that every thing is stored in the form of bits and bytes, so what generally this means - "you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character". A byte is 8 bits. So how does "each byte a number -- the number is the ASCII code". How can a byte contains a ASCII number(eg. 49 for '1') other than 0 and 1?
2) What exactly is the difference between Size and Size on Disk? And How does ASCII and Unicode fit into it?
3)In Java, Strings are objects. Can I say it be a multiple characters concated together?
String str = "Four score and seven years ago"
So how does a str stored in memory. Is it in the same manner as saving in the notepad file?

Files are stored in blocks. If file size is smaller than block size (in your case, 4KB) the file will take all block but most of its space is unused. I think this question was answered on SuperUser, i'll find the link.
UPDATE: https://superuser.com/questions/704218/why-is-there-such-a-big-difference-between-size-and-size-on-disk

To make a few short points:
"How can a byte contain an ASCII number (eg. 49 for '1') other than 0 and 1?
A Byte is 8 bits. Thus you can store numbers between 0 and 255 in it.
What is the difference between filesize and size on disk:
See MJafar Mash's answer: "size" is the actual size in bytes and "size on disk" is the number of bytes you need to allocate as blocks for the file to be placed in.
In Java Strings are Objects. Can I say that a String is multiple characters concatenated together?
Yes, but It's actually more complicated than that:
Taken from this answer:
Initializes a newly created String object so that it represents the same sequence of characters as the argument; in other words, the newly created string is a copy of the argument string. Unless an explicit copy of original is needed, use of this constructor is unnecessary since Strings are immutable.

1) i know that every thing is stored in the form of bits and bytes, so what generally this means - "you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character". A byte is 8 bits. So how does "each byte a number -- the number is the ASCII code". How can a byte contains a ASCII number(eg. 49 for '1') other than 0 and 1?
Each ASCII character occupies 1 byte. Internally, each character is stored as its ASCII number. So, you can store 8-bits of data max i.e, 2^8 -1 = 255. So the range would be 0-255.
2) What exactly is the difference between Size and Size on Disk? And How does ASCII and Unicode fit into it?
Each ASCII character is 1 byte. So, 30 bytes is the actual size of the data in the file. Next, the 4KB is the size of the segment/block in which the file is stored. In your case it is the minimum "new" space given to any file on the disk.
3)In Java, Strings are objects. Can I say it be a multiple characters concated together? String str = "Four score and seven years ago" So how does a str stored in memory. Is it in the same manner as saving in the notepad file?
Yes. Strings are indeed (internally) multiple characters concatenated together. But the characters cannot be changed.String is an object, so , they are stored as an array of characters (in java each character is 2 bytes). Java uses UTF-8 (it could be different based on various factors) as default Charset. You can also change it.

Related

What is the difference in bytes of a number as a string and as an integer?

Let's say we have a my_string = "123456"
I do
my_string.getBytes()
and
new BigInteger(123456).toByteArray()
The resulting byte arrays are different for both these cases. Why is that so? Isn't "123456" same as 123456 other than the difference in data type?
They are different because the String type is made up of unicode characters. The character '2' is not at all the same as the numeric value 2.
No. Why would they be? "123456" is a sequence of the ASCII character 1 (which is not represented as the number 1, but as the number 49), followed by the number 2 (50), and so on. 123456 as an int isn't even represented as a sequence of digits from 0-9, but it's stored as a number in binary.
I assume that you are asking about the total memory used to represent a number as a String versus a byte[].
The String size will depend on the actual string representation used. This depends on the JVM version; see What is the Java's internal represention for String? Modified UTF-8? UTF-16?
For Java 8 and earlier (with some caveats), the String consists of a String object with 1 int fields and 1 reference field. Assuming 64 bit references, that adds up to 8 bytes of header + 1 x 4 bytes + 1 x 8 bytes + 4 bytes of padding. Then add the char[] used to represent the characters: 12 bytes of header + 2 bytes per character. This needs to be rounded up to a multiple of 8.
For Java 9 and later, the main object has the same size. (There is an extra field ... but that fits into the "padding".) The char[] is replaced by a byte[], and since you are just storing ASCII decimal digits1, they will be encoded one character per byte.
In short, the asymptotic space usage is 1 byte per decimal digit for Java 9 or later and 2 bytes per decimal digit in Java 8 or earlier.
For the byte[] representation produce from a BigInteger, the represention consists of 12 bytes of header + 1 byte per byte ... rounded up to a multiple of 8. The asymptotic size is 1 byte per byte.
In both cases there is also the size of the reference to the representation; i.e. another 8 bytes.
If you do the sums, the byte[] representation is more compact than the String representation in all cases. But int or long are significantly more compact that either of these representations in all cases.
1 - If you are not ... or if you are curious why I added this caveat ... read the Q&A at the link above!

Java 11 Compact Strings magic behind char[] to byte[]

I been reading about encoding Unicode Java 9 compact Strings in the last two days i am getting quite well. But there is something that i dont understand.
About byte data type
1). Is a 8-bit storage ranges from -128 to 127
Questions
1). Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
2). Does the negative value mean something i mean i have try a simple example using Java 11
final char value = (char)200;//in byte would overflow
final String stringValue = new String(new char[]{value});
System.out.println(stringValue);//THE SAME VALUE OF JAVA 8
I have checked the String.value variable and i see a byte array of
System.out.println(value[0]);//-56
The same questions like before arise does the -56 mean something i mean the (negative value) in other languages this overflow is detected to return to the value 200? How can Java know that -56 value is the same as 200 in char.
I have try hardest examples like codepoint 128048 and i see in String.value variable a array of bytes like this.
0 = 61
1 = -40
2 = 48
3 = -36
I know this codepoint takes 4 bytes but i get it how is transformed char[] to byte[] but i dont know how String handle this byte[] data.
Sorry if this question is simple and sorry any typing english is not my natural language thanks a lot.
Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
Java’s primitive data types were settled with Java 1.0 a quarter century ago. The compact strings were introduced in Java 9, less than two years ago. This new feature, which is merely an implementation detail, did not justify fundamental changes at Java’s type system.
Besides that, you are looking at one interpretation of the data stored in a byte. For the sake of representing iso-latin-1 units, it is entirely irrelevant whether interpreting the same data as Java’s built-in signed byte would result in a positive or negative number.
Likewise Java’s I/O API allows reading a file into a byte[] array and write byte[] arrays back to files and these two operations are already sufficient to copy a file losslessly, regardless of its file format which would be relevant when interpreting its content.
So the following works since Java 1.1:
byte[] bytes = "È".getBytes("iso-8859-1");
System.out.println(bytes[0]);
System.out.println(bytes[0] & 0xff);
-56
200
The two numbers, -56 and 200 are just different interpretations of the bit pattern 11001000 whereas the iso-latin-1 interpretation of a byte containing the bit pattern 11001000 is the character È.
A char value is also just an interpretation of a two byte quantity, i.e. as UTF-16 code unit. Likewise, a char[] array is a sequence of bytes in the computer’s memory with a standard interpretation.
We can also interpret other byte sequences this way.
StringBuilder sb = new StringBuilder().appendCodePoint(128048);
byte[] array = new byte[4];
StandardCharsets.UTF_16LE.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
System.out.println(Arrays.toString(array));
will print the value you’ve seen, [61, -40, 48, -36].
The advantage of using a byte[] array inside the String class is, that now, the interpretation can be chosen, to use iso-latin-1 when all characters are representable with this encoding or utf-16 otherwise.
The possible numeric interpretations are irrelevant to the string. However, when you ask “How can Java know that -56 value is the same as 200”, you should ask yourself, how does it know that the bit pattern 11001000 of a byte is -56 in the first place?
System.out.println(value[0]);
bears an actually expensive operation, compared to ordinary computer arithmetic, the conversion of a byte (or an int) to a String. This conversion operation is often overlooked as it has been defined as the default way of printing a byte, but is not more natural than a conversion to a String interpreting the value as an unsigned quantity. For further reading, I recommend Two's complement.
This is because not all bytes in a string are interpreted the same. This depends to the string's character encoding.
Example:
if a string is an UTF-8 string, its characters will be 8-bits in size.
in an UTF-16 string, its characters will be 16-bits in size.
etc...
This means, if the string is to be represented as UTF-8, the characters will be made by reading 1 byte at a time; if 16-bits, the characters will made by reading 2 bytes at a time.
Look at this code: a single byte array data is transformed to string using UTF-8 and UTF-16.
byte[] data = new byte[] {97, 98, 99, 100};
System.out.println(new String(data, StandardCharsets.UTF_8));
System.out.println(new String(data, StandardCharsets.UTF_16));
The output of this code is:
abcd // 4 bytes = 4 chars, 1 byte per char
慢捤 // 4 bytes = 2 chars, 2 byte per char
Going back to the question, what motivated the developers to do so is to reduce memory footprint on strings. Not all strings uses all the 16-bits a char offers.
EDIT: Code here

How to write to a file in Java after Huffman Coding is done

I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.
The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.
So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.
There are a variety of solutions:
Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.
Is there a way to deal with this without using some bits? AFAIK, No.
1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.
Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.
What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.
So, to add the bits bits in value to the buffer:
bitBuffer |= value << bitCount;
bitcount += bits;
to write and remove available bytes:
while (bitCount >= 8) {
writeByte(bitBuffer & 0xff);
bitBuffer >>>= 8;
bitCount -= 8;
}
You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.
The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.

Java algorithm to Compress small numeric number

I need to compress 20-40 char size of a numeric number to a 6 char size number. So far I have tried Huffman and some Zip algorithms but not getting the desired outcome.
Can some one please advise any other Algorithm/API for this work in Java?
Example:
Input: 98765432101234567890
Desired Output: 123456
Please note: I didn't mean the output has to come as 12345 for the given input. I only mean that if I specify 20 byte number, it should be compressed to a 6 byte number.
Usage: Compressed number will be feeded to a device (which can only take up-to-6 numeric chars). Device will decode the number back to the original number.
Assumption/Limits:
If required both client and device(server) can share some common
properties required for encoding/decoding the number.
Only one request can be made to a device i.e. all data should be fed
in one request, no chunk of small packets
Thanks.
This will be the best you can get assuming that any combination of digits is a legal input:
final String s = "98765432101234567890";
for (byte b : new BigInteger('0'+s).toByteArray())
System.out.format("%02x ", b & 0xff);
Prints
05 5a a5 4d 36 e2 0c 6a d2
Storing a number in binary form is theoretically the most efficient way since every combination of bits is a distinct legal value.
You may have other options only if there is more redundancy in your input, that is there are some constraints on the legal digit combinations.
The way you specify it, this is not possible. There simply are more 20 digit numbers than there are 6 digit numbers, so if you map 20 digits to only six digits, some 20 digit numbers will have to be mapped to the same six digit number. If you know that not all numbers will be valid or even have the same likelyhood, this can be used for compression, but otherwise this is impossible.
Although a reversible (bijective) mapping from 20 digit numbers to six digit numbers is impossible it is still possible to map long numbers to shorter output. This works by reducing the requirement that the output needs to be a number. The only important consideration is that the output sequence needs to have the same number of possibilities as the input. Here is an example:
There are 10^20 possible 20 digit numbers
If you use a sequence of full 8-bit ASCII (256 characters) of length x you will have 256^x possible outputs. If you solve this for x, you will notice that 256^9 > 10^20 so 9 ASCII characters are enough to encode 20^10 possible numerical inputs.
Marko's answer to the same question will tell you how to convert a number to it's byte representation which may be used as input. But be aware that this input will not be numerical and may contain many strange symbols.

Generation if a unique id of size less than 11 bytes from a string

i am developing a piece of code to generate a unique hexadecimal value from an input string. The output size must be less than 11 bytes which comes as requirement.Can someone please give me an insight into this. I have done the string to binary conversion and then the hexagonal mapping which produces a combination of alphanumeric characters but the size is always greater tha 11 bytes. I also need to regenerate the input from this unique id..Is that possible.....
Thanks in adavance
If your result must be absolutely unique and your input can be any length, then your task is impossible.
Think of it that way: how many different combinations of 11 bytes are there? 25611 (or 211*8=288).
That's a big number, right? Yes, but it's not big enough.
For simplicities sake we'll talk about ASCII strings only, so we have 128 different values (in reality there are many more possibilities for a character in a Java String, but the principle stays the same. For simplicities sake we also ignore that a \0 character in a String is kind of unlikely).
Now, there are 12813 different 13-character ASCII strings. That's 27*13 or 291 different combinations. Obviously you can't have a unique id out of 288 possible ids for 291 different strings.
Less than 11 bytes means maximum 10 bytes.
8^10 is 1073741824.
2^80 is a huge number.
So if you take your hexvalue, and take it modulo that number, you should fit into the 10 bytes. Convert the remainder back to hex.
Regenerating the input will not be possible. If your input is allowed to be longer than 11 bytes, it will not be possible. That would be an endless compression.

Categories