XOR Encryption in Java: losing data after decryption

XOR Encryption in Java: losing data after decryption - java

I'm currently writing a very small Java program to implement a one-time-pad, where the pad (or key) itself is generated as a series of bytes using a SecureRandom object, which is seeded using a simple string with the SHA-512 algorithm.
Generating the one-time-pad hasn't caused any problems, and if I supply the same seed string each time, as expected I get the same sequence of psuedo-random numbers, making the decryption process possible as long as the person decrypting has the seed string used to encrypt.
When I try to encrypt a file, the program reads in the data 64 chars at a time (except for the end of file, which is generally an odd number), and generates 64 bytes (or matching amount) of psuedo random bytes. XOR is performed between the elements of both arrays, the resulting char array containing the cipher characters is written to file, and the process repeats until all text in the file has been read.
Now, because Java treats all primitives as signed numbers (the data type byte ranges from -128 to 127, not 0 to 255) this means that the XOR operation can (and does) result in some negative values (-128 to -1). It seems that Java does not recognise these values as valid ASCII, and simply writes a ? (question mark) to the file for any negative values. When it comes to reading from the file to decrypt the cipher text, the negative value that resulted in the ? to be written to file is lost, replaced with 63, the valid ASCII code for a question mark.
This means that XORing this value is useless, without the original value there is no way to produce the plaintext. Incidentally, if I reproduce the behaviour of encrypting some data and then decrypting the data immediately after, in the same program run, and printing status along the way, there are no problems. Only if the data is written to file is the information lost.
I should also mention that I did try adding 128 to each encryption XOR result, and then subtracting it before performing the decryption XOR (to put each value in a valid ASCII range), but the ? problem still showed up because there are 31 ASCII codes from 128 to 159 that I'm unable to read and appear as ?
I've been banging my head off the wall on this for a while now, any help is appreciated.
Cheers.

This is very confused. If you are processing a char array, the elements are 16 bits wide, they are unsigned, and not all values are valid. So (a) you cant possibly be having a problem with signs or bytes, and (b) you shouldn't be doing that at all. You should be reading the file into a byte array, XOR-ing, and writing out the byte array directly to the output file. No Readers or Writers, no chars, no Strings.

I guess the problem is in the way you write the file. Write directly the converted byte array to a FileOutputStream and do not try to convert it to string first. For reading, do the same thing, read it to a byte array.

Related

String hex hash to bytes

I have String hash in hex form ("e6fb06210fafc02fd7479ddbed2d042cc3a5155e") and I would like to compare it to crypt.digest().
One way, which works fine, is to convert crypt.digest() to hex, but I would like to avoid multiple conversions and rather convert hash from hex form (above) to byte array.
What I tried was:
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
but it does not match with crypt.digest(). When I convert hashBytes back to hex I get "00e6fb06210fafc02fd7479ddbed2d042cc3a5155e".
The leading zeros seem to be the reason why I fail to match byte arrays. Why do they occur? How can I get the same result using crypt.digest() and toByteArray?

The reason for the extra 00 is that e6 has it high (sign) bit set.
A redundant byte 00 makes it an unsigned value for BigInteger.
String hash = "e6fb06210fafc02fd7479ddbed2d042cc3a5155e";
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
hashBytes = hashBytes.length > 1 && hashBytes[0] == 0
? Arrays.copyOfRange(hashBytes, 1, hashBytes.length) : hashBytes;
System.out.println(Arrays.toString(hashBytes));
The question arises, what if the hash actually starts with a 00?
Then you need the hash length, or do a lenient comparison.

The answer can be found in the following answer from a thread about the highly related question Convert a string representation of a hex dump to a byte array using Java?:
The issue with BigInteger is that there must be a "sign bit". If the leading byte has the high bit set then the resulting byte array has an extra 0 in the 1st position. But still +1.
– Gray Oct 28 '11 at 16:20
Since the first bit has a special meaning (indicating the sign, 0 for positive, 1 for negative), BigInteger will prefix the data with an additional 0 in case your data started with a 1 on the high bit. Otherwise it would be interpreted as negative although it was not negative to begin with.
I.e. data like
101110
is turned into
0101110
You could easily undo this manually by using Arrays.copyOfRange(data, 1, data.length) if it happens.
However, instead of fixing that code, I would suggest using one of the other solutions posted in the linked thread. They are cleaner and easier to read and maintain.

Combining elements of a byte[] array into 16-bit numbers

This is an excerpt of code from a music tuner application. A byte[] array is created, audio data is read into the buffer arrays, and then the for loop iterates through buffer and combines the values at indices n,n+1, to create an array of 16-bit numbers that is half the length.
byte[] buffer = new byte[2*1200];
targetDataLine.read(buffer, 0, buffer.length)
for ( int i = 0; i < n; i+=2 ) {
int value = (short)((buffer[i]&0xFF) | ((buffer[i+1]&0xFF) << 8)); //**Don't understand**
a[i >> 1] = value;
}
So far, what I have is this:
From a different SO post, I learned that every byte being stored in a larger type must be & with 0xFF, due to its conversion to a 32-bit number. I guess the leading 24 bits are filled with 1s (though I don't know why it isn't filled with zeros... wouldn't leading with 1s change the value of the number? 000000000010 (2) is different from 111111110010 (-14), after all.), so the purpose of 0xff is to only grab the last 8 bits (which is the whole byte).
When buffer[i+1] is shifted left by 8 bits, this makes it so that, when ORing, the eight bits from buffer[i+1] are in the most significant positions, and the eight bits from buffer[i] are in the least significant eight bits. We wind up with a 16-bit number that is of the form buffer[i+1] + buffer[i]. (I'm using + but I understand it's closer to concatenation.)
First, why are we ORing buffer[i] | buffer[i+1] << 8? This seems to destroy the original sound information unless we pull it back out in the same way; while I understand that OR will combine them into one value, I don't see how that value can be useful or used in calculations later. And the only way this data is accessed later is as its literal values:
diff += Math.abs(a[j]-a[i+j];
If I have 101 and 111, added together I should get 12, or 1100. Yet 101 | 111 << 3 gives 111101, which is equal to 61. The closest I got to understanding was that 101 (5) | 111000 (56) is the same as adding 5+56=61. But the order matters -- doing the reverse 101 <<3 | 111 is completely different. I really don't understand how the data can remain useful, when it is OR'd in this way.
The other problem I'm having is that, because Java uses signed bytes, the eighth position doesn't indicate the value, but the sign. If I'm ORing two binary signed numbers, then in the resulting 16-bit number, the bit at 2⁷ is now acting as a value instead of a placeholder. If I had a negative byte before running the OR, then in my final value post-operation, it would now erroneously be acting as though the original number had a positive 2⁷ in it. 0xff doesn't get rid of this, because it preserves the eighth, signed byte, so shouldn't this be a problem?
For example, 1111 (-1) and 0101, when OR'd, might give 01011111. But 1111 wasn't representing POSITIVE 1111, it was representing the signed version; yet in the final answer, it now is acting as a positive 2³.
UPDATE: I marked the accepted answer, but it took that + a little extra work to figure out where I went wrong. For anyone who may read this in the future:
As far as the signing goes, the code I have uses signed bytes. My only guess as to why this doesn't mess anything up is because all of the values received might be of positive sign. Except that this doesn't make sense, given a waveform varies amplitude from [-1,1]. I'm going to play around with this to try and figure it out. If there are negative signs, the implementation of code here doesn't seem to remove the 1 when ORing, so I suspect that it doesn't affect the computation too much (given that we're dealing with really large values (diff += means diff will be really large -- a few extra 1s shouldn't hurt the outcome given the code and the comparisons it relies on. So this was all wrong. I gave it some more thought and it's really simple, actually -- the only reason this was such a problem is because I didn't know about big-endian, and then once I read about it, I misunderstood exactly how it is implemented. Endian-ness explained in the next bulletpoint.
Regarding the order in which the bits are placed, destroying the sound, etc. The code I'm using sets bigEndian=false, meaning that the byte order goes from least significant byte to most significant byte. For this reason, combining the two indices of buffer requires taking the second index, placing its bits first, and placing the first index as second (so we are now in big-endian byte order). One of the problems I had was the impression that "endian-ness" determines the bit order. I thought 10010101 big-endian would become 10101001 small-endian. Turns out this is not the case -- the bits in each byte remain in their original order; the difference is that the bytes are ordered "backward". So 10110101 111000001 big-endian becomes 11100001 10110101 -- same bit order within each byte; however, different byte order.
Finally, I'm not sure why, but the accepted answer is correct: targetDataLine.read() may place the bits into a byte array only (not just in my code, but in all Java code using targetDataLine -- read() only accepts arguments where the destination var is a byte array), but the data is in fact one short split into two bytes. It is for this reason that every two indices must be combined together.
Coming back to the signing goes, it should be obvious by now why this isn't an issue. This is the commenting that I now have in the code, which more coherently explains what it took all of this^ to explain before:
/* The Javadoc explains that the targetDataLine will only read to a byte-typed array.
However, because the sample size is 16-bit, it is actually storing 16-bit numbers
there (shorts), auto-parsing them every eight bits. Additionally, because it is storing
them in little-endian, bits [2^0,2^7] are stored in index[i] in normal order (powers 76543210)
while bits [2^8,2^15] are stored in index[i+1]. So, together they currently read as [7-6-5-4-3-2-1-0 15-14-13-12-11-10-9-8],
which is a problem. In the next for loop, we take care of this and re-organize the bytes by swapping every pair (remember the bits are ok, but the bytes are out of order).
Also, although the array is signed, this will not matter when we combine bytes, because the sign-bit (2^15) will be placed
back at the beginning like it normally is; although 2^7 currently exists as the most significant bit in its byte,
it is not a sign-indicating bit,
because it is really the middle of the short which was split. */

This is combining the byte stream from input in low bytes first byte order to a stream of shorts in internal byte order.
With sign extesion it is more a question of the sign encoding of the original byte stream. If the original byte stream is unsigned (coding values from 0 to 255), then the overcomes the then unwanted effects of java treating values as signed. So educated guess is taht the external byte strem encodes unsigned bytes.
Judging whether the code is plausible needs information on what externel encoding is being treated and what internal encoding is used. E.g. (wild guess could be totally wrong!): the two byte junks read coud belong to 2 channels of a stereo sound encoding and are put into a single short for ease of internal processing. You should look at the encoding being read and the use of the converted data within the application.

How to write to a file in Java after Huffman Coding is done

I have implemented a class for Huffman coding. The class will parse an input file and build a huffman tree from it and creates a map which has each of the distinct characters appeared in the file as the key and the huffman code of the character as its value.
For example, let the string "aravind_is_a_good_boy" be the only line in the file. When you build the huffman tree and generate the huffman code for each character, we can see that, for the character 'a', the huffman code is '101' and for the character 'r', the huffman code is '0101' etc.
My intention is to compress the file. So I cannot write a string, which is created by replacing each character, by its huffman code, directly to the file. Since, each character would be replaced by at least 3 characters (Each '1' and '0' would still be written into the file as a character, not bits). So I thought I would write it to a file as a bytes, since there is no way you can write bits to a file. But then, 'a' and 'r' are both written as '5' into the file. This would cause problem when trying to decompress the file.
This is how I am converting a series of bits to bytes:
public byte[] compressString(String s, CharCodeHashMap map) {
String byteString = "";
byte[] byteArr = new byte[s.length()];
int size = 0;
for (int i = 0; i < s.length(); i++) {
byteString += addPaddingZeros(map.getCompressedChar(s.charAt(i)));
byteArr[size++] = new BigInteger(byteString, 2).toByteArray()[0];
byteString = "";
}
return byteArr;
}
I tried prefixing '1' to each of the hashcodes, to fix the problem. But then, when you build a huffman tree, reading a file, some characters would have more than 8 bits. Then, the problem is new BigInteger(byteString, 2).toByteArray() would have more than 1 element in the array.(For eg, if 'v' has the hashcode '11010001' and new BigInteger(byteString, 2).toByteArray() returns an array of elements [0, -47].)
Can someone please suggest me a way to write to a file such that, the file would be compressed and at the same time, these problems are also taken care.

The problem is that files in modern operating systems are modeled as indexable sequences of bytes1.
So what you need is a way to encode the fact that your file is representing a number of bits that may not be a multiple of 8. That means the bit stream size is not necessarily the file size (in bytes) multiplied by 8.
There are a variety of solutions:
Reserve N bytes at the start of the file for the file size in bits. For example, reserving 4 bytes allows you to represent file sizes up to 232 bits.
Reserve 3 bits at the start of the file to hold the number of bits modulo 8. You can use this to decide how many bits in the last byte of the file to ignore.
Use some kind of encoding to represent the end of stream; e.g. represent it as a character in the text stream that you are encoding.
Is there a way to deal with this without using some bits? AFAIK, No.
1 - And at a lower level, files are represented as sequences of disk blocks consisting of multiple bytes. So, from a physical storage perspective, compressing files that are already small (e.g. smaller than a disk block) doesn't achieve anything. Similarly saving or not saving (say) 3 bits when the representation is modeled as a byte sequence is at the border of being pointless ... if that was what was concerning you.

Yes, you can write bits to a file. In fact you are always writing bits to a file. The only thing is that you are writing eight bits at a time.
What you need is a bit buffer, say a 32-bit unsigned variable, into which you accumulate bits. Have another integer that tracks how many bits are in the bit buffer. Use the shift left and or (or plus) operators to put more bits in the bit buffer, and the and and shift right operators to remove them. Whenever you have eight or more bits in the bit buffer, you write those eight bits to the file as a byte. At the end, write the remaining bits (if any) to the file as the last byte.
So, to add the bits bits in value to the buffer:
bitBuffer |= value << bitCount;
bitcount += bits;
to write and remove available bytes:
while (bitCount >= 8) {
writeByte(bitBuffer & 0xff);
bitBuffer >>>= 8;
bitCount -= 8;
}
You need to make sure that when decoding, you don't mistake the filler bits in the last byte as another code. You can either send the actual number of bits in the message preceding the message (or the number of bits in the last byte), or you can add a symbol to your alphabet for end-of-stream that gets its own Huffman code, and end the message with that.
The other problem you have is that you will also need to transmit the Huffman code itself to the decoder before the coded symbols in order for the decoder to know how to decode. Look up "canonical Huffman codes" for how to approach that efficiently.

how to find the type of integer from a file using java

I have a file that contains integer values of different bit lengths (4 bytes, 2 bytes), but I don't know the layout of these values in the file (i.e. whether a value is a 4 bytes or 2 bytes integer). For example, a file may have two 4-byte integers followed by five 2-byte integers, and another file may have three 2-byte integers first and then four 4-byte integers. Is there a way to read such values?
I want to write code that takes such a file and reads a value irrespective of its byte size. Right now I am using DataInputStream, and by knowing the layout of the values, using some viewer in advance to read the values. But in this manner everything is hard coded, and my code is not generic.

Your going to have to "parse" or "read" or "do something" with the viewer data and use the refactored viewer info as file format definition info during the reads.

How to read file created by C++ program in java?

I have one file created by c++ program which is in encrypted format. I want to read it in my java program. In case of decryption of file contents, decryption algorithm is performing operations on byte[which is unsigned char-BYTE in c/c++]. I used same decryption algorithm which I have used in my c/c++ program. This algorithm contains ^, %, * and - operations on byte. But byte datatype of java is signed because of which I am facing problems in decryption. How can I read file or process read data with 1byte at a time which is unsigned?
thanks in advance.

byte b = <as read from file>;
int i = b & 0xFF;
Perform operations on i as required

The standard method InputStream.read() reads one byte and fits it into a int, so in practice it is an unsinged byte. There are no unsigned primitive data types in java, so the only approach is to fit it in an upper primitive.
That being said you should have no trouble performing encryption/decryption over data bytes read from the file, since the bytes are the same, no matter if they are interpreted as signed or unsigned (0xFF can be 255 or -1). You say the alghorithm contains "^, %, *", etc. That is an interpretation of raw bytes, taking into account a character encoding (that fits 8 bit per character I suppose). You should not perform encryption/decryption operations over other than raw bytes.

First, InputStream.read() returns an int but it holds a byte; it uses an int so -1 can be returned if the EOF is reached. If the int is not -1, you can cast it to byte.
Second, there are read() metods that allow storing the bytes directly in a byte[]
And last, if you are going to use the file as a byte[] (and it is not too big) maybe it would be interesting copying the data from FileInputStream and write it into a ByteArrayOutputStream. You can get the resulting byte[] from the late object (note: do not use the .read() method, use .read(byte[], int, int) for performance).

Since there is no unsigned primitive type in Java, I think what you can do is to convert signed byte into integer (which will virtually be unsigned because the integer will always be positive). You can follow the code in here: Can we make unsigned byte in Java for the conversion.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.