java huffman compressor output bigger than original - java

I'm doing a Huffman compressor for homework and I managed to build the Huffman tree and the code of 0 and 1 for all the char but the output file is bigger then the original.
There was a question which like mine here
Unable to compress file during Huffman Encoding in Java
but I didn't get it very much.
My code:
this.HuffmanTreeBulid();////create the Huffman tree
HuffmanNode root =tree;
this.codeGenerator(root, codes);//create the hushmap
try
{
FileOutputStream out2 = new FileOutputStream(fileOut);//for the new file
FileInputStream in = new FileInputStream(fileInput);//for reading again the orignal file
FileWriter out = new FileWriter(fileOut);
//String code;
char currentchar;
int currentByte;//int for going on all the bytes from the file
if(!fileOut.exists())//if new file exits then replace it if not create it
fileOut.createNewFile();
else
{
fileOut.delete();
fileOut.createNewFile();
}
while((currentByte = in.read())!=-1)
{
int currentint =currentByte& 0xff;//"& 0xff" is for unsigned int
currentchar=(char)currentint;
byte[] c=(huffmanCodes.get(currentchar)).getBytes();
//out.write(huffmanCodes.get(code2));
//out.write(huffmanCodes.get(currentchar));//for FileWriter
out2.write(c);
}
in.close();
out.close();
out2.close();
}
catch (IOException e)
{
e.printStackTrace();
}
updete 1:
i understand the problem so i traid doing this
int bitIndex = 0;
for (int i=0;i<codes.length();i++)
{
if(codes.charAt(i)=='1')
buffer.set(bitIndex++);
else
buffer.clear(bitIndex++);
}
still dosnt work :(
updete 2: im doing this for getting the byte from the string
byte[] bytes = new BigInteger(binaryString, 2).toByteArray();
for (byte b : bytes)
{
out2.write(b);
}
still wont work but its the close i can get till now
maybe the byte is fine but im writing in a wrong way?

The Problem is the following line:
byte[] c=(huffmanCodes.get(currentchar)).getBytes();
You try to get your coded String to bare bits and bytes. But in fact, getBytes()returns just the encoded bytesequence in your platform standard. So you get maybe an UTF-8 Byte encoding for the character "1" and an UTF-8 Byte encoding for the character "0".
You have to parse your String to a byte. You can see how to do so here:
java: convert binary string to int
or here:
How to convert binary string to a byte?
you can read more about the getBytes method here:
https://beginnersbook.com/2013/12/java-string-getbytes-method-example/
as #9000 mentioned you do not have a Bitstream.
Working with compressors bitstreams might be more suitable than working with complete bytes. so parsing a complete byte will not compress your string as a char remains the size of a char.
what you can do, is to concatenate the resulting binary strings and then parse the string to bytes at the end. Be aware of trailing zeros.

I would suggest adding something like this:
class BitstreamPacker {
private int bitPos; // Actual values 0..7; where to add the next bit.
private ArrayList<Byte> data;
public addBit(bool bit) {
// Add the bit to the last byte of data; allocate more if does not fit.
// Adjusts bitPos as it goes.
}
public void writeBytes(ByteOutputStream output) {
// Writes the number of bytes, then the last bit pos, then the bytes.
}
}
Similarly,
class BitstreamUnpacker {
private byte[] data; // Or ArrayList if you wish.
private int currentBytePos;
private int currentBitPos; // Could be enough to track the global bit position.
public static BitstreamUnpacker fromByteStream(ByteInputStream input) {
// A factory method; reads the stream and creates an instance.
// Uses the byte count to allocate the right amount of bytes;
// uses the bit count to limit the last byte to the actual number of bits.
return ...;
}
public Bool getNextBit() {
// Reads bits sequentially from the internal data.
// Returns null when the end of data is reached.
// Or feel free to implement an iterator / iterable.
}
}
Note that the bit stream may end at the middle of the byte, so storing the count of bits in the last byte is required.
To help you better understand the idea, here's some Python code (because Python is easy to play with interactively):
class BitstreamPacker(object):
def __init__(self):
self.data = [] # A list of bytes.
self.bit_offset = 0 # 0..7.
def add_bit(self, bit):
if self.bit_offset == 0: # We must begin a new byte.
self.data.append(0) # Append a new byte.
# We use addition because we know that the bit we're affecting is 0.
# [-1] means last element.
self.data[-1] += (bit << self.bit_offset)
self.bit_offset += 1
if self.bit_offset > 7: # We've exceeded one byte.
self.bit_offset = 0 # Shift the offset to the beginning of a byte.
def get_bytes(self):
# Just returning the data instead of writing, to simplify interactive use.
return (len(self.data), self.bit_offset, self.data)
How does it work from Python REPL?
>>> bp = BitstreamPacker()
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.get_bytes()
(1, 2, [3]) # One byte, two bits in it are used.
>>> bp.add_bit(0)
>>> bp.add_bit(0)
>>> bp.add_bit(0)
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.add_bit(1)
>>> bp.get_bytes()
(1, 0, [227]) # Whole 8 bits of one byte were used.
>>> bp.add_bit(1)
>>> bp.get_bytes()
(2, 1, [227, 1]) # Two bytes used: one full, and one bit in the next.
>>> assert 0b11100011 == 227 # The binary we sent matches.
>>> _
I hope this helps.

Related

Java, trying to create a specific network byte header based on length of command

I'm running into some trouble when attempting to create a network byte header. The header should be 2 bytes long, which simply defines the length of the following command.
For example; The following command String "HED>0123456789ABCDEF" is 20 characters long, which is 0014 as hex signed 2 complement, creating the network byte header for this command works as the command is under 124 characters. The following snippet of code essentially works out the byte header and adds the following prefix to the command \u00000\u0014 when the command is under 124 characters.
However for commands that are 124 characters or above, the code in the if block doesn't work. Therefore, I looked into possible alternatives and tried a couple of things regarding generating hex characters and setting them as the network byte header, but as they aren't bytes it's not going to work (As seen in the else block). Instead the else block simply returns 0090 for a command which is 153 characters long which is technically correct, but I'm not able to use this 'length' header the same way as the if blocks length header
public static void main(String[] args) {
final String commandHeader = "HED>";
final String command = "0123456789ABCDEF";
short commandLength = (short) (commandHeader.length() + command.length());
char[] array;
if( commandLength < 124 )
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = new String( bb.array() ).toCharArray();
}
else
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = convertToHex(bb.array());
}
final String command = new String(array) + commandHeader + command;
System.out.println( command );
}
private static char[] convertToHex(byte[] data) {
final StringBuilder buf = new StringBuilder();
for (byte b : data) {
int halfByte = (b >>> 4) & 0x0F;
int twoHalves = 0;
do {
if ((0 <= halfByte) && (halfByte <= 9))
buf.append((char) ( '0' + halfByte));
halfByte = b & 0x0F;
} while (twoHalves++ < 1);
}
return buf.toString().toCharArray();
}
Furthermore, I have managed to get this working in Python 2 by doing the following three lines, no less! This returns the following network byte header for a 153 character command as \x00\x99
msg_length = len(str_header + str_command)
command_length = pack('>h', msg_length)
command = command_length + str_header + str_command
Also simply replicated by running Python 2 and entering the following commands:
In [1]: import struct
In [2]: struct.pack('>h', 153)
Out[2]: '\x00\x99'
Any assistance, or light that could be shed to resolve this issue would be greatly appreciated.
The basic problem is that you (try to) convert fundamentally binary data to character data. Furthermore, you do it using the platform's default charset, which varies from machine to machine.
I think you have mischaracterized the problem slightly, however. I am confident that it arises when command.length() is at least 124, so that commandLength, which includes the length of commandHeader, too, is at least 128. You would also find that there are some (much) larger command lengths that worked, too.
The key point here is that when any of the bytes in the binary representation of the length have their most-significant bit set, that is meaningful to some character encodings, especially UTF-8, which is a common (but not universal) default. Unless you get very lucky, binary lengths that have any such bytes will not be correctly decoded into characters in UTF-8. Moreover, they may get decoded into characters successfully but differently on machines with that use different charsets for the purpose.
You also have another, related inconsistency. You are formatting data for transmission over the network, which is a byte-oriented medium. The transmission will be a sequence of bytes. But you are measuring and reporting the number of characters in the decoded internal representation, not the number of bytes in the encoded representation that will go over the wire. The two counts are the same for your example command, but they would differ for some strings that you could express in Java.
Additionally, your code is inconsistent with your description of the format wanted. You say that the "network byte header" should be four bytes long, but your code emits only two.
You can address all these issues by taking character encoding explicitly into account, and by avoiding the unneeded and inappropriate conversion of raw binary data to character data. The ByteBuffer class you're already using can help with that. For example:
public static void main(String[] args) throws IOException {
String commandHeader = "HED>";
// a 128-byte command
String command = "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF";
// Convert characters to bytes, and do so with a specified charset
// Note that ALL Java implementations are required to support UTF-8
byte[] commandHeaderBytes = commandHeader.getBytes("UTF-8");
byte[] commandBytes = command.getBytes("UTF-8");
// Measure the command length in bytes, since that's what the receiver
// will need to know
int commandLength = commandHeaderBytes.length + commandBytes.length;
// Build the whole message in your ByteBuffer
// Allow a 4-byte length field, per spec
ByteBuffer bb = ByteBuffer.allocate(commandLength + 4);
bb.putInt(commandLength)
.put(commandHeaderBytes)
.put(commandBytes);
// DO NOT convert to a String or other character type. Output the
// bytes directly.
System.out.write(bb.array());
System.out.println();
}

How FileInputStream and FileOutputStream Works in Java?

I'm reading about all input/output streams in java on Java Tutorials Docs. Tutorials writer use this example:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class CopyBytes {
public static void main(String[] args) throws IOException {
FileInputStream in = null;
FileOutputStream out = null;
try {
in = new FileInputStream("xanadu.txt");
out = new FileOutputStream("outagain.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
} finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
xanadu.txt File data:
In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.
Output to outagain.txt file:
In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.
Why do the writers use int c even if we are reading characters?
Why use -1 in while condition?
How out.write(c); method convert int to again characters?
1: Now I want to ask why writer use int c? even we are reading characters.
FileInputStream.read() returns one byte of data as an int. This works because a byte can be represented as an int without loss of precision. See this answer to understand why int is returned instead of byte.
2: The second why use -1 in while condition?
When the end of file is reached, -1 is returned.
3: How out.write(c); method convert int to again characters? that provide same output in outagain.txt file
FileOutputStream.write() takes a byte parameter as an int. Since an int spans over more values than a byte, the 24 high-order bits of the given int are ignored, making it a byte-compatible value: an int in Java is always 32 bits. By removing the 24 high-order bits, you're down to a 8 bits value, i.e. a byte.
I suggest you read carefully the Javadocs for each of those method. As reference, they answer all of your questions:
read:
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255. If no byte is available because the end of the stream has been reached, the value -1 is returned. This method blocks until input data is available, the end of the stream is detected, or an exception is thrown.
write:
Writes the specified byte to this output stream. The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
Just read the docs.
here is the read method docs
http://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html#read()
public int read()
throws IOException
Reads a byte of data from this input stream. This method blocks if no input is yet available.
Specified by:
read in class InputStream
Returns:
the next byte of data, or -1 if the end of the file is reached.
That int is a your next set of bytes data.
Now , here are the answers.
1) When you assign a char to an int, it denotes it's ascii number to the int.
If you are interested, here us the list of chars and their ascii codes https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html
2)-1 if the end of the file is reached. So that's a check to data exists or not.
3)When you send an ascii code to print writer, it's prints that corresponding char to the file.

char[] into ascii and decimal value (double) java

based on this array :
final char[] charValue = { 'u', ' ', '}','+' };
i want to print the double value and the ascii value from it in Java.
i can't find a proper solution for that in internet. I just found how to convert a single Character into Integer value. But what about many characters?
the main problem is, i have a large char[] and some double and int values are stored in. for double values they are stored within 4 bytes size and integer 1 or 2 bytes so i have to read all this and convert into double or integer.
Thanks for you help
When java was designed, there was C char being used for binary bytes and text.
Java made a clear separation between binary data (byte[], InputStream/OutputStream) and Unicode text (char, String, Reader/Writer). Hence Java has full Unicode support. The binary data, byte[], need information: their used encoding, in order to be convertable to text: char[]/String.
In Java a char[] will rarely be used (as in C/C++), and it seems byte[] is intended, as you mention 4 elements to be used for an int etcetera. A char is 16 bits, containing UTF-16 text.
For this case one can use a ByteBuffer either wrapping a byte[] or being taken from a memory mapped file.
Writing
ByteBuffer buf = ByteBuffer.allocate(13); // 13 bytes
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
buf.putInt(42); // at 0
buf.putDouble(Math.PI); // at 4
buf.put((byte) '0'); // at 12
buf.putDouble(4, 3.0); // at 4 overwrite PI
byte[] bytes = buf.array();
Reading
ByteBuffer buf = ByteBuffer.wrap(bytes);
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
int a = buf.getInt();
double b = buf.getDouble();
byte c = buf.get();

Converting string to binary and back again does not give the same string

I'm writing a Simplified DES algorithm to encrypt and subsequently decrypt a string. Suppose I have the initial character ( which has the binary value 00101000 which I get using the following algorithm:
public void getBinary() throws UnsupportedEncodingException {
byte[] plaintextBinary = text.getBytes("UTF-8");
for(byte b : plaintextBinary){
int val = b;
int[] tempBinRep = new int[8];
for(int i = 0; i<8; i++){
tempBinRep[i] = (val & 128) == 0 ? 0 : 1;
val <<= 1;
}
binaryRepresentations.add(tempBinRep);
}
}
After I perform the various permutations and shifts, ( and it's binary equivalent is transformed into 10001010 and it's ASCII equivalent Š. When I come around to decryption I pass the same character through the getBinary() method I now get the binary string 11000010 and another binary string 10001010 which translates into ASCII as x(.
Where is this rogue x coming from?
Edit: The full class can be found here.
You haven't supplied the decrypting code, so we can't know for sure, but I would guess you missed the encoding either when populating your String. Java Strings are encoded in UTF-16 by default. Since you're forcing UTF-8 when encrypting, I'm assuming you're doing the same when decrypting. The problem is, when you convert your encrypted bytes to a String for storage, if you let it default to UTF-16, you're probably ending up with a two-byte character because the 10001010 is 138, which is beyond the 127 range for ASCII charaters that get represented with a single byte.
So the "x" you're getting is the byte for the code page, followed by the actual character's byte. As suggested in the comments, you'd do better to just store the encrypted bytes as bytes, and not convert them to Strings until they're decrypted.

My java class implementation of XOR encryption has gone wrong

I am new to java but I am very fluent in C++ and C# especially C#. I know how to do xor encryption in both C# and C++. The problem is the algorithm I wrote in Java to implement xor encryption seems to be producing wrong results. The results are usually a bunch of spaces and I am sure that is wrong. Here is the class below:
public final class Encrypter {
public static String EncryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
public static String DecryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
}
Strings in Java are Unicode - and Unicode strings are not general holders for bytes like ASCII strings can be.
You're taking a string and converting it to bytes without specifying what character encoding you want, so you're getting the platform default encoding - probably US-ASCII, UTF-8 or one of the Windows code pages.
Then you're preforming arithmetic/logic operations on these bytes. (I haven't looked at what you're doing here - you say you know the algorithm.)
Finally, you're taking these transformed bytes and trying to turn them back into a string - that is, back into characters. Again, you haven't specified the character encoding (but you'll get the same as you got converting characters to bytes, so that's OK), but, most importantly...
Unless your platform default encoding uses a single byte per character (e.g. US-ASCII), then not all of the byte sequences you will generate represent valid characters.
So, two pieces of advice come from this:
Don't use strings as general holders for bytes
Always specify a character encoding when converting between bytes and characters.
In this case, you might have more success if you specifically give US-ASCII as the encoding. EDIT: This last sentence is not true (see comments below). Refer back to point 1 above! Use bytes, not characters, when you want bytes.
If you use non-ascii strings as keys you'll get pretty strange results. The bytes in the kbytes array will be negative. Sign-extension then means that val will come out negative. The cast to char will then produce a character in the FF80-FFFF range.
These characters will certainly not be printable, and depending on what you use to check the output you may be shown "box" or some other replacement characters.

Categories