Java- gzip member trailer - java

This is part of a larger assignment that I've mostly got done except for this one part, which is a bit embarrassing because it sounds really simply on paper.
So basically, I've got a large amount of compressed data. I've been keeping track of the length using a CRC32
CRC32 checksum = new CRC32();
...
//read input into buffer
checksum.update(buff, 0, bytesRead);
So it updates everytime more info is read in. I've also kept track of the uncompress length using
uncompressedLength += manage.read(buff);
So it is an int value that has the number of bytes of the original file. This is a little Endian machine.
From what I can tell, what I need is four byte CRC, which I used
public byte[] longToBytes(long x) {
ByteBuffer buffer = ByteBuffer.allocate(8);
buffer.putLong(x);
return buffer.array();
}
byte[] c = longToBytes(checksum.getValue());
BUT this is 8 bytes. CRC32.getValue returns a long. Can I convert it to an int in this case without losing information I need?
And then the ISIZE is supposed to be...the four byte compressed length modulo 2^32. I've got the variable uncompresedLength which is an int. I think I just have to convert it to bytes and that's all?
I've been hexdumping the result from gzip and the result from my program and my header and data are right, I'm just missing my trailer.
As for why I'm doing this manually, it's because of an assignment. Trust me, I'd love to just use GZIPOoutputStream if I could.

CRC32 has 32 bits... the class returns long because of the super interface.
uncompressed length should be long, since nowadays files larger than 2G isn't uncommon.
so in both cases, you need to convert the lowest 32 bits of a long to 4 bytes.
static byte[] lower4bytes(long v)
{
return new byte[] {
(byte)(v ),
(byte)(v>> 8),
(byte)(v>>16),
(byte)(v>>24)
};
}

To write an integer in little-endian form, simply write the low byte of the integer (i.e. modulo 256 or anded with 0xff), then shift it down eight bits or divide by 256, then write the resulting low byte, and repeat that two more times. You'll write four bytes. Since you only write four, you will automatically be writing the length modulo 232.

Related

How byte array in Java is used to represent a binary data?

I have read that byte array in Java is used to represent a binary data. I am not able to understand this. How byte array can represent a binary data (and which can be transferred over the network and can be constructed back to original form).
Byte can have (integer) values from -128 to 127; so how does a byte array represent a binary data?
Byte can be (integer) values -128 to 127, so how does a byte
array represent a binary data?
Each byte (octet) is a sequence of eight bits, and having sequence of bytes lets us represent binary data of any length (though it's limited to per 8-bits increments).
Memory of most modern computers is addressed as a sequence of bytes, network interfaces send packets containing sequences of bytes, hard drives store sequences of bytes (but are addressable only in much larger blocks, say, 4096 bytes).
There is rarely need to access data bit-by-bit, and when needed it can be done with bitwise operators, so no data type for sequence of bits is provided by default.
So to conclude:
1 Byte == 8 bits, and Byte Array == stream of bits,
and hence represent binary data?
Yes. For example: A Byte Array of length 2 bytes is a stream of 16 bits of binary data.

Combining elements of a byte[] array into 16-bit numbers

This is an excerpt of code from a music tuner application. A byte[] array is created, audio data is read into the buffer arrays, and then the for loop iterates through buffer and combines the values at indices n,n+1, to create an array of 16-bit numbers that is half the length.
byte[] buffer = new byte[2*1200];
targetDataLine.read(buffer, 0, buffer.length)
for ( int i = 0; i < n; i+=2 ) {
int value = (short)((buffer[i]&0xFF) | ((buffer[i+1]&0xFF) << 8)); //**Don't understand**
a[i >> 1] = value;
}
So far, what I have is this:
From a different SO post, I learned that every byte being stored in a larger type must be & with 0xFF, due to its conversion to a 32-bit number. I guess the leading 24 bits are filled with 1s (though I don't know why it isn't filled with zeros... wouldn't leading with 1s change the value of the number? 000000000010 (2) is different from 111111110010 (-14), after all.), so the purpose of 0xff is to only grab the last 8 bits (which is the whole byte).
When buffer[i+1] is shifted left by 8 bits, this makes it so that, when ORing, the eight bits from buffer[i+1] are in the most significant positions, and the eight bits from buffer[i] are in the least significant eight bits. We wind up with a 16-bit number that is of the form buffer[i+1] + buffer[i]. (I'm using + but I understand it's closer to concatenation.)
First, why are we ORing buffer[i] | buffer[i+1] << 8? This seems to destroy the original sound information unless we pull it back out in the same way; while I understand that OR will combine them into one value, I don't see how that value can be useful or used in calculations later. And the only way this data is accessed later is as its literal values:
diff += Math.abs(a[j]-a[i+j];
If I have 101 and 111, added together I should get 12, or 1100. Yet 101 | 111 << 3 gives 111101, which is equal to 61. The closest I got to understanding was that 101 (5) | 111000 (56) is the same as adding 5+56=61. But the order matters -- doing the reverse 101 <<3 | 111 is completely different. I really don't understand how the data can remain useful, when it is OR'd in this way.
The other problem I'm having is that, because Java uses signed bytes, the eighth position doesn't indicate the value, but the sign. If I'm ORing two binary signed numbers, then in the resulting 16-bit number, the bit at 2⁷ is now acting as a value instead of a placeholder. If I had a negative byte before running the OR, then in my final value post-operation, it would now erroneously be acting as though the original number had a positive 2⁷ in it. 0xff doesn't get rid of this, because it preserves the eighth, signed byte, so shouldn't this be a problem?
For example, 1111 (-1) and 0101, when OR'd, might give 01011111. But 1111 wasn't representing POSITIVE 1111, it was representing the signed version; yet in the final answer, it now is acting as a positive 2³.
UPDATE: I marked the accepted answer, but it took that + a little extra work to figure out where I went wrong. For anyone who may read this in the future:
As far as the signing goes, the code I have uses signed bytes. My only guess as to why this doesn't mess anything up is because all of the values received might be of positive sign. Except that this doesn't make sense, given a waveform varies amplitude from [-1,1]. I'm going to play around with this to try and figure it out. If there are negative signs, the implementation of code here doesn't seem to remove the 1 when ORing, so I suspect that it doesn't affect the computation too much (given that we're dealing with really large values (diff += means diff will be really large -- a few extra 1s shouldn't hurt the outcome given the code and the comparisons it relies on. So this was all wrong. I gave it some more thought and it's really simple, actually -- the only reason this was such a problem is because I didn't know about big-endian, and then once I read about it, I misunderstood exactly how it is implemented. Endian-ness explained in the next bulletpoint.
Regarding the order in which the bits are placed, destroying the sound, etc. The code I'm using sets bigEndian=false, meaning that the byte order goes from least significant byte to most significant byte. For this reason, combining the two indices of buffer requires taking the second index, placing its bits first, and placing the first index as second (so we are now in big-endian byte order). One of the problems I had was the impression that "endian-ness" determines the bit order. I thought 10010101 big-endian would become 10101001 small-endian. Turns out this is not the case -- the bits in each byte remain in their original order; the difference is that the bytes are ordered "backward". So 10110101 111000001 big-endian becomes 11100001 10110101 -- same bit order within each byte; however, different byte order.
Finally, I'm not sure why, but the accepted answer is correct: targetDataLine.read() may place the bits into a byte array only (not just in my code, but in all Java code using targetDataLine -- read() only accepts arguments where the destination var is a byte array), but the data is in fact one short split into two bytes. It is for this reason that every two indices must be combined together.
Coming back to the signing goes, it should be obvious by now why this isn't an issue. This is the commenting that I now have in the code, which more coherently explains what it took all of this^ to explain before:
/* The Javadoc explains that the targetDataLine will only read to a byte-typed array.
However, because the sample size is 16-bit, it is actually storing 16-bit numbers
there (shorts), auto-parsing them every eight bits. Additionally, because it is storing
them in little-endian, bits [2^0,2^7] are stored in index[i] in normal order (powers 76543210)
while bits [2^8,2^15] are stored in index[i+1]. So, together they currently read as [7-6-5-4-3-2-1-0 15-14-13-12-11-10-9-8],
which is a problem. In the next for loop, we take care of this and re-organize the bytes by swapping every pair (remember the bits are ok, but the bytes are out of order).
Also, although the array is signed, this will not matter when we combine bytes, because the sign-bit (2^15) will be placed
back at the beginning like it normally is; although 2^7 currently exists as the most significant bit in its byte,
it is not a sign-indicating bit,
because it is really the middle of the short which was split. */
This is combining the byte stream from input in low bytes first byte order to a stream of shorts in internal byte order.
With sign extesion it is more a question of the sign encoding of the original byte stream. If the original byte stream is unsigned (coding values from 0 to 255), then the overcomes the then unwanted effects of java treating values as signed. So educated guess is taht the external byte strem encodes unsigned bytes.
Judging whether the code is plausible needs information on what externel encoding is being treated and what internal encoding is used. E.g. (wild guess could be totally wrong!): the two byte junks read coud belong to 2 channels of a stereo sound encoding and are put into a single short for ease of internal processing. You should look at the encoding being read and the use of the converted data within the application.

How to write to a file in Java after Huffman Coding is done

I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.
The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.
So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.
There are a variety of solutions:
Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.
Is there a way to deal with this without using some bits? AFAIK, No.
1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.
Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.
What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.
So, to add the bits bits in value to the buffer:
bitBuffer |= value << bitCount;
bitcount += bits;
to write and remove available bytes:
while (bitCount >= 8) {
writeByte(bitBuffer & 0xff);
bitBuffer >>>= 8;
bitCount -= 8;
}
You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.
The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.

How to send value bigger than 127 in byte Java

I am working on an Smart Card where there is a method in javax.smartcardio.CommandAPDU.
CommandAPDU(int cla, int ins, int p1, int p2, byte[] data, int ne)
I need to send data as byte[] (5th argument). Now my problem is that, as Java primitive data types are signed the max value of a byte can not exceed 127. I need to send a value bigger than 127. To be precise, the hex value 94 which is equal to 148.
As some solution suggests that we can cast it to integer.
byte b = -108;
int i = b & 0xff;
I can't do that as the CommandAPDU(); constructor doesn't take an []. So how to do it?
Depending on how it is interpreted by the smart card, you could just send the correct negative value. If the smart card interprets value as unsigned, you could for example send -1 for 255.
You're calculating the APDU with unsigned bytes, while Java uses signed bytes.
It's just a matter of how the data is interpreted, sending -108 to the smart card will be interpreted in exactly the same way as sending 148 from a platform using unsigned bytes. The bit combination is exactly the same.
Java can even do the conversion itself so that you can write the code using unsigned numbers;
byte data = (byte)0x94; // stores -108 in "data", which will be interpreted
// as 148 on an unsigned platform
For long blocks of data, it is probably best to use a hexadecimal encoder/decoder. But be sure that you handle the data as bytes internally (directly decode and don't look back to the hex String). The Apache codec library contains a good encoder/decoder, or you can use Bouncy Castle or Guava or use one of the many examples on SO.

Storing 16Bit Audio on a 8bit byte array in android

I'm confused. I needed to record sound from MIC in Android so I used the following code:
recorder = new AudioRecord(AudioSource.MIC, 44100,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT, N);
buffer = new byte[N];
//...
recorder.read(buffer, 0, N);
As we know, a byte array can store values between -128 to +128 while a 16Bit sound needs a lot more storage(e.g. short and int) but surprisingly Java and Android have a record method which saves recorded data to a byte array.
How that can be possible? What am I missing?
You are thinking of byte as a shot integer. It is just 8 bits. You need to store 1000111011100000 (16 bits)? First byte is 10001110, second byte is 11100000. That you can interpret these bits as numbers is not relevant here. In a more general way, byte[] is usually how you deal with binary "raw data" (let it be audio streams or encrypted content or anything else that you treat like a stream of bits).
If you have n "words" of 16 bits then you will need 2n bytes to store it. Byte 0 will be lower (or higher) part of word 0, byte 1 will be the rest of word 0, byte 0 will be lower (or higher) part of word 1...

Categories