I have Huffman coding project that in first step we obtain code of each character depends on Huffman tree.I obtain code of each character for example : a = 01 , b= 101 , c = 111.these codes are String and i want to save them in a file with .cmp extension in binary for example we have a text such : abc and encoding is:01101111 how can i write them to a file with binary value in a file with .cmp extension and after that read them and decode them?
Hopefully you know that bytes and integers consist of bits, so you just need to build a little queue of bits that is a single integer containing the bits and another integer that tracks the number of bits in the first integer, accumulating bits using the shift and or operators. Once you have accumulated a byte, write it out and shift it out of your queue. E.g. to put n bits in buf |= val << bits; bits += n;, and then to pull bits out if you have enough: while (bits >= 8) { write_out(buf & 0xff); buf >>= 8; bits -= 8;. Make sure that you integer is large enough to handle the largest value of n you will have. I.e., buf needs to be able to hold maxn+7 bits, since the while loop will never leave more than 7 bits in the buffer.
if you want to work with bit streams then it is easier to take completed framework, for instance JBBP (java binary block parser) which has JBBPBitOutputStream class providing bit write operations (also there is JBBPBitInputStream class to read bits from streams)
Related
I'm trying to read a 31 bit long Integer from an InputStream in java and i can't figure a way out for doing this. I receive four byte from the InputStream and the first bit of the first byte is a reserved bit which is always unset (0x0) and the rest is 31 bit long integer.Here is a visualization of what i described :
+-+-------------+---------------+-------------------------------+
|R| 31 bit long Integer |
+-+-------------------------------------------------------------+
I would appreciate it if you could help me come up with a solution. Thanks!
I'm trying to read a 31 bit long Integer from an InputStream
That is impossible.
The minimum size of thing you can read is a byte, which is 8 bits; all things you can read from them are a multiple of 8.
and the first bit of the first byte is a reversed byte which is always onset (0x0)
This sentence doesn't make any sense. The first 'bit of a byte' cannot be a 'reversed byte'. Given that bits are a 1-dimensional concept, there is no such thing as a 'reversed bit', and 'onset', if it means anything, means '1' and not '0', and bits are not as a rule communicated in '0x' syntax, which is hexadecimal.
I conclude you must be confused about the API.
However, to be a bit more helpful: If you have 4 bytes of data that contains a 31-bit-length integer, then:
You need to know if it is 'big endian' or 'little endian'. It will be someplace in the docs; usually protocols are big endian.
That first bit can trivially be stripped away or isolated, which should help.
Assuming big endian:
try (InputStream raw = socket.getInputStream();
DataInputStream data = new DataInputStream(raw)) {
int v = data.readInt();
boolean isolatedBit = (v >>> 31) != 0;
v = v & 0x7FFFFFFF;
}
DataInputStream has the readInt() call that takes care of business.
isolatedBit will be 0 if that 'R' bit is unset, and '1' if it is iset.
Even if this R thing is set, that last line will ensure that the value of v has that bit unset. As a consequence, the number will be between 0 and 2^31-1 (thus, always positive).
NB: After some corrections to the original question, this is much simpler:
Given that the reserved bit is always unset, you can just call int v = data.readInt(), that's the only thing in the try block that would then be required. Had the 'reserved bit' always been a 1 - you would need that & 0x7FFFFFFF to get rid of it.
I read SHA-256 from a book, but the book doesn't explain what it is for? The book explained how to create it in Java. However, I failed to understand what Integer.toString((byteData[i] & 0xff) + 0x100, 16).substring(1)) is for. Can someone explain it to me in detail?
Integer.toString((byteData[i] & 0xff) + 0x100, 16).substring(1))
is one way of converting a byte value to a string that's exacly two character wide, showing the byte's hexadecimal value. Having a look at String's Javadoc page will help.
The combination of 0x100 and substring(1) ensures that byte values < than 16 decimal (that is, 0 to F in hex) are also represented as two characters.
By the way:
String.format("%02x",byteData[i])
does exactly the same, and might be considered more readable, especially by people who are used to C printf style format strings.
Lastly, why (byteData[i] & 0xff) ? See here for a detailed explanation:
It works because Java will perform a widening conversion to int, using
sign extension, so instead of a negative byte you will have a negative
int. Masking with 0xff will leave only the lower 8 bits, thus making
the number positive again (and what you initially intended).
SHA-256 is called a hash algorithm, and its purpose is simple: it takes any data and generates a unique series of bytes to represent it. There is no way to reverse this process, and there are no known instances of a SHA-256 hash being not unique.
The purpose of the line of code in question is to generate one character of the final SHA-256 output. Java gives you the has in raw data (a byte array) and we typically convert it to hexadecimal to represent it as a string. That line of code is pretty complex, so I'll go over what each part of it does separately.
sb.append(); is taking the imput and adding it to the result stored in a StringBuilder.
Integer.toString(); Takes a number and represents it as a literal string
byteData[i] & 0xff Selects the current byte of hash data and uses the bitwise and operation using 0xff (so for each bit in the byte, if the corresponding bit in 0xff is the same, the output is a 1, if not the output is a 0.
string.substring(1); Outputs the string starting after the first character.
This is an excerpt of code from a music tuner application. A byte[] array is created, audio data is read into the buffer arrays, and then the for loop iterates through buffer and combines the values at indices n,n+1, to create an array of 16-bit numbers that is half the length.
byte[] buffer = new byte[2*1200];
targetDataLine.read(buffer, 0, buffer.length)
for ( int i = 0; i < n; i+=2 ) {
int value = (short)((buffer[i]&0xFF) | ((buffer[i+1]&0xFF) << 8)); //**Don't understand**
a[i >> 1] = value;
}
So far, what I have is this:
From a different SO post, I learned that every byte being stored in a larger type must be & with 0xFF, due to its conversion to a 32-bit number. I guess the leading 24 bits are filled with 1s (though I don't know why it isn't filled with zeros... wouldn't leading with 1s change the value of the number? 000000000010 (2) is different from 111111110010 (-14), after all.), so the purpose of 0xff is to only grab the last 8 bits (which is the whole byte).
When buffer[i+1] is shifted left by 8 bits, this makes it so that, when ORing, the eight bits from buffer[i+1] are in the most significant positions, and the eight bits from buffer[i] are in the least significant eight bits. We wind up with a 16-bit number that is of the form buffer[i+1] + buffer[i]. (I'm using + but I understand it's closer to concatenation.)
First, why are we ORing buffer[i] | buffer[i+1] << 8? This seems to destroy the original sound information unless we pull it back out in the same way; while I understand that OR will combine them into one value, I don't see how that value can be useful or used in calculations later. And the only way this data is accessed later is as its literal values:
diff += Math.abs(a[j]-a[i+j];
If I have 101 and 111, added together I should get 12, or 1100. Yet 101 | 111 << 3 gives 111101, which is equal to 61. The closest I got to understanding was that 101 (5) | 111000 (56) is the same as adding 5+56=61. But the order matters -- doing the reverse 101 <<3 | 111 is completely different. I really don't understand how the data can remain useful, when it is OR'd in this way.
The other problem I'm having is that, because Java uses signed bytes, the eighth position doesn't indicate the value, but the sign. If I'm ORing two binary signed numbers, then in the resulting 16-bit number, the bit at 2⁷ is now acting as a value instead of a placeholder. If I had a negative byte before running the OR, then in my final value post-operation, it would now erroneously be acting as though the original number had a positive 2⁷ in it. 0xff doesn't get rid of this, because it preserves the eighth, signed byte, so shouldn't this be a problem?
For example, 1111 (-1) and 0101, when OR'd, might give 01011111. But 1111 wasn't representing POSITIVE 1111, it was representing the signed version; yet in the final answer, it now is acting as a positive 2³.
UPDATE: I marked the accepted answer, but it took that + a little extra work to figure out where I went wrong. For anyone who may read this in the future:
As far as the signing goes, the code I have uses signed bytes. My only guess as to why this doesn't mess anything up is because all of the values received might be of positive sign. Except that this doesn't make sense, given a waveform varies amplitude from [-1,1]. I'm going to play around with this to try and figure it out. If there are negative signs, the implementation of code here doesn't seem to remove the 1 when ORing, so I suspect that it doesn't affect the computation too much (given that we're dealing with really large values (diff += means diff will be really large -- a few extra 1s shouldn't hurt the outcome given the code and the comparisons it relies on. So this was all wrong. I gave it some more thought and it's really simple, actually -- the only reason this was such a problem is because I didn't know about big-endian, and then once I read about it, I misunderstood exactly how it is implemented. Endian-ness explained in the next bulletpoint.
Regarding the order in which the bits are placed, destroying the sound, etc. The code I'm using sets bigEndian=false, meaning that the byte order goes from least significant byte to most significant byte. For this reason, combining the two indices of buffer requires taking the second index, placing its bits first, and placing the first index as second (so we are now in big-endian byte order). One of the problems I had was the impression that "endian-ness" determines the bit order. I thought 10010101 big-endian would become 10101001 small-endian. Turns out this is not the case -- the bits in each byte remain in their original order; the difference is that the bytes are ordered "backward". So 10110101 111000001 big-endian becomes 11100001 10110101 -- same bit order within each byte; however, different byte order.
Finally, I'm not sure why, but the accepted answer is correct: targetDataLine.read() may place the bits into a byte array only (not just in my code, but in all Java code using targetDataLine -- read() only accepts arguments where the destination var is a byte array), but the data is in fact one short split into two bytes. It is for this reason that every two indices must be combined together.
Coming back to the signing goes, it should be obvious by now why this isn't an issue. This is the commenting that I now have in the code, which more coherently explains what it took all of this^ to explain before:
/* The Javadoc explains that the targetDataLine will only read to a byte-typed array.
However, because the sample size is 16-bit, it is actually storing 16-bit numbers
there (shorts), auto-parsing them every eight bits. Additionally, because it is storing
them in little-endian, bits [2^0,2^7] are stored in index[i] in normal order (powers 76543210)
while bits [2^8,2^15] are stored in index[i+1]. So, together they currently read as [7-6-5-4-3-2-1-0 15-14-13-12-11-10-9-8],
which is a problem. In the next for loop, we take care of this and re-organize the bytes by swapping every pair (remember the bits are ok, but the bytes are out of order).
Also, although the array is signed, this will not matter when we combine bytes, because the sign-bit (2^15) will be placed
back at the beginning like it normally is; although 2^7 currently exists as the most significant bit in its byte,
it is not a sign-indicating bit,
because it is really the middle of the short which was split. */
This is combining the byte stream from input in low bytes first byte order to a stream of shorts in internal byte order.
With sign extesion it is more a question of the sign encoding of the original byte stream. If the original byte stream is unsigned (coding values from 0 to 255), then the overcomes the then unwanted effects of java treating values as signed. So educated guess is taht the external byte strem encodes unsigned bytes.
Judging whether the code is plausible needs information on what externel encoding is being treated and what internal encoding is used. E.g. (wild guess could be totally wrong!): the two byte junks read coud belong to 2 channels of a stereo sound encoding and are put into a single short for ease of internal processing. You should look at the encoding being read and the use of the converted data within the application.
I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.
The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.
So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.
There are a variety of solutions:
Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.
Is there a way to deal with this without using some bits? AFAIK, No.
1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.
Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.
What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.
So, to add the bits bits in value to the buffer:
bitBuffer |= value << bitCount;
bitcount += bits;
to write and remove available bytes:
while (bitCount >= 8) {
writeByte(bitBuffer & 0xff);
bitBuffer >>>= 8;
bitCount -= 8;
}
You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.
The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.
I have in C++:
typedef struct _msk {
char abc[4];
//some more variables
}
_msk mr;
if (some condition >= 70) {
mr.abc[0] |= 0xC0; //C0 in binary 11000000
mr.abc[1] |= 0x20; //20 in binary 100000
mr.abc[2] |= 0x44; //44 in binary 1000100
}
here OR operation is goin on after which the value will be stored.
So in memory is it like (0th)11000000(1st)100000(2nd)1000100 as these are in array? how many bits can be actually stored in [4](total 0+1+2+3+4).
In Java:
private BitSet abc = new BitSet(40);
if a updation or modification of bits are required we can utilise set or get methods provided by the bitset class.
In java if we need to carry out the OR operation we need to add the 0's in the suffix to have the same result. Which we can avoid in c++??
Thanks
So in memory is it like
(0th)11000000(1st)100000(2nd)1000100
as these are in array?
almost: (0th)11000000(1st)00100000(2nd)01000100
how many bits can be actually stored
in [4](total 0+1+2+3+4).
Not 0+...+4, but 0+...+3, 4 chars of size 8 bits = 32 bits (indices: 0, 1, 2, 3)
And you can still use the bitwise operators in Java - not entirely sure what you are asking regarding the Java code ??
Since you are using the |=, you are actually doing mr.abc[0] = mr.abc[0] | 0xC0;
So this means the result depends on the original value of mr.abc[0], which may or may not be 0.
Also there are 8 bits in a char(1 byte)..so with 4 elements in the array there are 32 bits total.
Java uses the exact same notations for all the bitwise operations. I am not sure where you are going with the BitSet.