Open file in its bit representation and manipulate bits in Java

Open file in its bit representation and manipulate bits in Java - java

I know that using a hexadecimal editor, one can edit binary files and change 4 bits with each hexadecimal value, But I am kind of thinking of a project that requires to modify a single bit rather than 4-bits.
So Is there a way to read something (e.g. ASCII coded plain text-file) in bits and manipulate single bits in e.g. Java?
As a noob, I can think of loading each bytes and generating a string containing each 8-bit representation of each byte, but that is kind of quite a complex way and will waste a lot of space. Also, this approach would require me to keep a list containing each available byte's 8-bit representation to look it up.

Others have already hinted that your question is broad and can probably be better answered by other sites, ressources or search.
However I'd like to show you some small snippet with which you can start.
// Read a file into a byte array, we are not interested
// in interpreting encodings, just plain bytes
final byte[] fileContent = Files.readAllBytes(Paths.get("pathToMyFile"));
// Iterate the content and display one byte per line
for (final byte data : fileContent) {
// Convert to a regular 8-bit representation
System.out.println(Integer.toBinaryString(data & 255 | 256).substring(1));
}
You can also easily manipulate the bytes and also the bits by simply accessing the array contents and using simple bit operators like &, |, ^.
Here is a snippet showing you how to set and unset a bit at a given position:
byte data = ...;
// Set it (to 1)
data = data | (1 << pos);
// Unset it (to 0)
data = data & ~(1 << pos);
The conversion gets explained here: how to get the binary values of the bytes stored in byte array
The bit manipulation here: Set specific bit in byte
Here are some relevant Java-Docs: Files#readAllBytes and Integer#toBinaryString
Note that from a view of efficiency the smallest you can go in Java is byte, there is no bit data type. However in practice you will probably not see any difference, the CPU already loads the whole neighboring bits and bytes into the cache regardless of you want to use them or not. Thus you can just use the byte data type and use them to manipulate single bits.

Related

Combining elements of a byte[] array into 16-bit numbers

This is an excerpt of code from a music tuner application. A byte[] array is created, audio data is read into the buffer arrays, and then the for loop iterates through buffer and combines the values at indices n,n+1, to create an array of 16-bit numbers that is half the length.
byte[] buffer = new byte[2*1200];
targetDataLine.read(buffer, 0, buffer.length)
for ( int i = 0; i < n; i+=2 ) {
int value = (short)((buffer[i]&0xFF) | ((buffer[i+1]&0xFF) << 8)); //**Don't understand**
a[i >> 1] = value;
}
So far, what I have is this:
From a different SO post, I learned that every byte being stored in a larger type must be & with 0xFF, due to its conversion to a 32-bit number. I guess the leading 24 bits are filled with 1s (though I don't know why it isn't filled with zeros... wouldn't leading with 1s change the value of the number? 000000000010 (2) is different from 111111110010 (-14), after all.), so the purpose of 0xff is to only grab the last 8 bits (which is the whole byte).
When buffer[i+1] is shifted left by 8 bits, this makes it so that, when ORing, the eight bits from buffer[i+1] are in the most significant positions, and the eight bits from buffer[i] are in the least significant eight bits. We wind up with a 16-bit number that is of the form buffer[i+1] + buffer[i]. (I'm using + but I understand it's closer to concatenation.)
First, why are we ORing buffer[i] | buffer[i+1] << 8? This seems to destroy the original sound information unless we pull it back out in the same way; while I understand that OR will combine them into one value, I don't see how that value can be useful or used in calculations later. And the only way this data is accessed later is as its literal values:
diff += Math.abs(a[j]-a[i+j];
If I have 101 and 111, added together I should get 12, or 1100. Yet 101 | 111 << 3 gives 111101, which is equal to 61. The closest I got to understanding was that 101 (5) | 111000 (56) is the same as adding 5+56=61. But the order matters -- doing the reverse 101 <<3 | 111 is completely different. I really don't understand how the data can remain useful, when it is OR'd in this way.
The other problem I'm having is that, because Java uses signed bytes, the eighth position doesn't indicate the value, but the sign. If I'm ORing two binary signed numbers, then in the resulting 16-bit number, the bit at 2⁷ is now acting as a value instead of a placeholder. If I had a negative byte before running the OR, then in my final value post-operation, it would now erroneously be acting as though the original number had a positive 2⁷ in it. 0xff doesn't get rid of this, because it preserves the eighth, signed byte, so shouldn't this be a problem?
For example, 1111 (-1) and 0101, when OR'd, might give 01011111. But 1111 wasn't representing POSITIVE 1111, it was representing the signed version; yet in the final answer, it now is acting as a positive 2³.
UPDATE: I marked the accepted answer, but it took that + a little extra work to figure out where I went wrong. For anyone who may read this in the future:
As far as the signing goes, the code I have uses signed bytes. My only guess as to why this doesn't mess anything up is because all of the values received might be of positive sign. Except that this doesn't make sense, given a waveform varies amplitude from [-1,1]. I'm going to play around with this to try and figure it out. If there are negative signs, the implementation of code here doesn't seem to remove the 1 when ORing, so I suspect that it doesn't affect the computation too much (given that we're dealing with really large values (diff += means diff will be really large -- a few extra 1s shouldn't hurt the outcome given the code and the comparisons it relies on. So this was all wrong. I gave it some more thought and it's really simple, actually -- the only reason this was such a problem is because I didn't know about big-endian, and then once I read about it, I misunderstood exactly how it is implemented. Endian-ness explained in the next bulletpoint.
Regarding the order in which the bits are placed, destroying the sound, etc. The code I'm using sets bigEndian=false, meaning that the byte order goes from least significant byte to most significant byte. For this reason, combining the two indices of buffer requires taking the second index, placing its bits first, and placing the first index as second (so we are now in big-endian byte order). One of the problems I had was the impression that "endian-ness" determines the bit order. I thought 10010101 big-endian would become 10101001 small-endian. Turns out this is not the case -- the bits in each byte remain in their original order; the difference is that the bytes are ordered "backward". So 10110101 111000001 big-endian becomes 11100001 10110101 -- same bit order within each byte; however, different byte order.
Finally, I'm not sure why, but the accepted answer is correct: targetDataLine.read() may place the bits into a byte array only (not just in my code, but in all Java code using targetDataLine -- read() only accepts arguments where the destination var is a byte array), but the data is in fact one short split into two bytes. It is for this reason that every two indices must be combined together.
Coming back to the signing goes, it should be obvious by now why this isn't an issue. This is the commenting that I now have in the code, which more coherently explains what it took all of this^ to explain before:
/* The Javadoc explains that the targetDataLine will only read to a byte-typed array.
However, because the sample size is 16-bit, it is actually storing 16-bit numbers
there (shorts), auto-parsing them every eight bits. Additionally, because it is storing
them in little-endian, bits [2^0,2^7] are stored in index[i] in normal order (powers 76543210)
while bits [2^8,2^15] are stored in index[i+1]. So, together they currently read as [7-6-5-4-3-2-1-0 15-14-13-12-11-10-9-8],
which is a problem. In the next for loop, we take care of this and re-organize the bytes by swapping every pair (remember the bits are ok, but the bytes are out of order).
Also, although the array is signed, this will not matter when we combine bytes, because the sign-bit (2^15) will be placed
back at the beginning like it normally is; although 2^7 currently exists as the most significant bit in its byte,
it is not a sign-indicating bit,
because it is really the middle of the short which was split. */

This is combining the byte stream from input in low bytes first byte order to a stream of shorts in internal byte order.
With sign extesion it is more a question of the sign encoding of the original byte stream. If the original byte stream is unsigned (coding values from 0 to 255), then the overcomes the then unwanted effects of java treating values as signed. So educated guess is taht the external byte strem encodes unsigned bytes.
Judging whether the code is plausible needs information on what externel encoding is being treated and what internal encoding is used. E.g. (wild guess could be totally wrong!): the two byte junks read coud belong to 2 channels of a stereo sound encoding and are put into a single short for ease of internal processing. You should look at the encoding being read and the use of the converted data within the application.

How to write to a file in Java after Huffman Coding is done

I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.

The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.
So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.
There are a variety of solutions:
Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.
Is there a way to deal with this without using some bits? AFAIK, No.
1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.

Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.
What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.
So, to add the bits bits in value to the buffer:
bitBuffer |= value << bitCount;
bitcount += bits;
to write and remove available bytes:
while (bitCount >= 8) {
writeByte(bitBuffer & 0xff);
bitBuffer >>>= 8;
bitCount -= 8;
}
You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.
The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.

Java - Why does OutputStream.write(int) take an int to write a byte? [duplicate]

Maybe someone can help me understand because I feel I'm missing something that will likely have an effect on how my program runs.
I'm using a ByteArrayOutputStream. Unless I've missed something huge, the point of this class is to create a byte[] array for some other use.
However, the "plain" write function on BAOS takes an int not a byte (ByteArrayOutputStream.write).
According to this(Primitive Data Types) page, in Java, an int is a 32-bit data type and a byte is an 8-bit data type.
If I write this code...
int i = 32;
byte b = i;
I get a warning about possible lossy conversions requiring a change to this...
int i = 32;
byte b = (byte)i;
I'm really confused about write(int)...

ByteArrayOutputStream is just overriding the abstract method declared in OutputStream. So the real question is why OutputStream.write(int) is declared that way, when its stated goal is to write a single byte to the stream. The implementation of the stream is irrelevant here.
Your intuition is correct - it's a broken bit of design, in my view. And yes, it will lose data, as is explicitly called out in the docs:
The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
It would have been much more sensible (in my view) for this to be write(byte). The only downside is that you couldn't then call it with literal values without casting:
// Write a single byte 0. Works with current code, wouldn't work if the parameter
// were byte.
stream.write(0);
That looks okay, but isn't - because the type of the literal 0 is int, which isn't implicitly convertible to byte. You'd have to use:
// Ugly, but would have been okay with write(byte).
stream.write((byte) 0);
For me that's not a good enough reason to design the API the way it is, but that's what we've got - and have had since Java 1.0. It can't be fixed now without it being a breaking change all over the place, unfortunately.

In order to facilitate unsigned bytes above 0x7F this takes place. The int will be silently narrowed to be written. In fact, the code does that with a (byte) cast.
As Ingo states:
A possible reason could be that the byte to write will most often be the result of some operation that automatically converts its operands to int[, like] some bit operations. Hence, the code would be littered with casts to byte, that add nothing to understanding.

That's mostly because the Java Virtual Machine model's stack hates byte, but loves int. The stack uses 32-bit slots , which matches the size of an int.
You will notice however that java *does* like byte[] references. But that's because the content of arrays is stored in the heap (not the stack). And whenever a specific byte is addressed and moved to the stack (the bipush or sipush opcodes) then they are immediately converted to integers.
But sometimes java actually uses 257(!) values. When an InputStream#read() returns 256 values, but when it has no content it will return a -1 value. Alternatively, it would have been possible to throw an EOFException (like some other methods do) but exceptions are slow in java.
Even though you don't need the -1 value for the OutputStream#write it's coherent, and it reduces casting. But yes, it's missleading too.

How can a java.lang.float be encoded as TH3IFMw?

I need to parse some data that has encoded primitive types (ints, floats, doubles, floats) outputted by java. I'm adding this functionality to an existing set of python scripts, so rewriting it in Java isn't really an option. I'd like to re-implement and/or use a python library to decode the data (e.g. TH3IFMw for a float).
I don't recognize this encoding. I'm working with the requests sent to Google Web Toolkit, and based on the source here and here - I thought it was string.ValueOf - but this is incorrect. Does anyone recognize it?

I think this is encoding a long int, not a float. In particular, it's probably 0x0000004c7dc814cc, but might be 0x00000131f7205330.
My reasoning...
Looking through the code you linked to, it doesn't look like anything remotely out of the ordinary is being done to floats, and the standard valueOf implementation definitely does nothing like this.
On the other hand, the string TH3IFMw looks for all the world like a base64 encoded string. I can't think of many other common encodings that use upper alpha, lower alpha, and digits. Looking through the same code, I can only find one reference to base64... line 575 of StreamWriter, where it handles the encoding long instances. This is the only part of the linked code which seems even remotely capable of generating the output you observed.
Looking at the size of the string... assuming it is base64, it's missing a trailing = padding/alignment character, but some implementations of base64 do omit these for brevity. Adding that back (TH3IFMw=), and decoding as base64, this results in the hex value 0x4c7dc814cc. This is only 5 bytes in size, which is a little odd. But this does mean it's probably not a float (4 bytes) or double (8 bytes).
But this could fit with line 575's encoding of a long... looking at the documentation for Base64Utils.toBase64, it makes reference to the fact that "Leading groups of all zero bits are omitted." This would explain the 5 byte value, if the original long was 0x0000004c7dc814cc.
However, the documentation's wording is frustratingly ambiguous (and I don't have java+gwt available to me right now to test). "leading groups of all zero bits" could mean they are omitting source bytes which are all zeros, but it could also meaning they're omitting leading A characters from the encoded base64 characters (A represents 6 0 bits in base64). If that's the case, then the actual base64 string is ATH3IFMw, which decodes to the long value 0x00000131f7205330.
If you can find either of those numbers in what you're providing as input, then that's probably what's happening. If not... I'm afraid I'm stumped.

Similar functionality for java to struct for python

I have a program that I made in Python to find specific tags in TIFF IFD's and return the values. It was just a proof of concept thing in python, and now I need to move the functionality to java. I think I can just use the String(byteArray[]) constructor for the ASCII data types, but I still need to get Unsigned short (2 byte) and unsigned long (4 byte) values. I don't need to write them back to the file or modify them, all I need to do is get a Java Integer or Long object from them. This is easy in python with the struct and mmap classes, does any one know of a similar way in java? I looked at the DataInput class, but the readUnsignedLong method reads 8 bytes.

DataInputStream allows you to read shorts and longs. You should mask them with the appropriate bit mask (0xFFFF for short, 0xFFFFFFFF for 32 bit) in order to account for the difference between signed/unsigned types.
e.g.
// omits error handling
FileInputStream fis = ...;
DataInputStream stream = new DataInputStream(fis);
int short_value = 0xFFFF & stream.readShort();
long long_value = 0xFFFFFFFF & stream.readInt();
If you're sure that the data won't be towards the high end of the 2 byte field, or 4 byte field, you can forego the bit masking. Otherwise, you need to use a wider data type to account for the fact that unsigned values hold a larger range of values than their signed counterparts.

I looked at the DataInput class, but the readUnsignedLong method reads 8 bytes.
Java does not have unsigned types. It takes 4 bytes to make an int, and 8 bytes to make a long, unsigned or otherwise.
If you don't want to use DataInput, you can read the bytes into byte arrays (byte[]) and use a ByteBuffer to turn those byte values into ints and longs with left padding. See ByteBuffer#getInt() and ByteBuffer#getLong().

DataInput would be the preferred method. You can use readUnsignedShort for the two byte values. For the 4 byte values you'll have to use this workaround...
long l = dis.readInt() & 0xffffffffL;

You could use Javolution's Struct class which provides structure to regions of data. You set up a wrapper and then use the wrapper to access the data. Simples. Java really needs this super-useful class in its default classpath TBQH.

Preon Library is good to create struct in Java. I have tried Javolution's Struct but it was not help full my case. It is open source and very good library.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.