How do you write a custom deflation dictionary and implement it - java

How do you write a custom deflation dictionary and implement it
{"playerX":"64","playerY":"224","playerTotalHealth":"100","playerCurrentHealth":"100","playerTotalMana":"50","playerCurrentMana":"50","playerExp":"0","playerExpTNL":"20","playerLevel":"1","points":"0","strength":"1","dexterity":"1","constitution":"1","intelligence":"1","wisdom":"1","items":["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24"],"currentMapX":"0","currentMapY":"0","playerBody":"1","playerHair":"6","playerClothes":"7"}
This is the String I am trying to compress.
Things that will never change are the names of each variable, so I wanna add that to a dictionary (this is a json object)
There are a lot of things I could put into a dictionary, such as
"playerX":
"playerY":
I am trying to compress this to as small as I can get it.
I just dont know how to implement it into a dictionary. I know that I have to use a byte[], but how do I separate words in a byte[]?
Currently the code I provided below compresses it to a length of 253 from 494. I wanna try to get it as small as I can. Since it is a small String I would rather have more compression than speed.
you dont have to solve it for me, but maybe provide hints and sources etc on what I can do to make this string mega small
public static void main(String[] args)
{
deflater("String");
}
public static String deflater(String str)
{
System.out.println("Original: " + str + ":End");
System.out.println("Length: " + str.length());
byte[] input = str.getBytes();
Deflater d = new Deflater();
d.setInput(input);
d.setLevel(1);
d.finish();
ByteArrayOutputStream dbos = new ByteArrayOutputStream(input.length);
byte[] buffer = new byte[1024];
while(d.finished() == false)
{
int bytesCompressed = d.deflate(buffer);
System.out.println("Total Bytes: " + bytesCompressed);
dbos.write(buffer, 0, bytesCompressed);
}
try
{
dbos.close();
}
catch(IOException e1)
{
e1.printStackTrace();
System.exit(0);
}
//Dictionary implementation required!
byte[] compressedArray = dbos.toByteArray();
String compStr = new String(compressedArray);
System.out.println("Compressed: " + compStr + ":End");
System.out.println("Length: " + compStr.length());
return null;
}

The dictionary is simply your common strings concatenated to make a sequence of bytes less than or equal to 32K in length. You do not need to separate words. There is no structure to the dictionary. It is simply used as a source of data to match the current string to. You should put the more common strings at the end of the dictionary, since it takes fewer bits to encode shorter distances back.

Related

Java Deflater for large set of random strings

I am using the Deflater class to try to compress a large set of random strings. My compression and decompression methods look like this:
public static String compressAndEncodeBase64(String text) {
try {
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
dos.write(text.getBytes());
}
byte[] bytes = os.toByteArray();
return new String(Base64.getEncoder().encode(bytes));
} catch (Exception e){
log.info("Caught exception when trying to compress {}: ", text, e);
}
return null;
}
public static String decompressB64(String compressedAndEncodedText) {
try {
byte[] decodedText = Base64.getDecoder().decode(compressedAndEncodedText);
ByteArrayOutputStream os = new ByteArrayOutputStream();
try (OutputStream ios = new InflaterOutputStream(os)) {
ios.write(decodedText);
}
byte[] decompressedBArray = os.toByteArray();
return new String(decompressedBArray, StandardCharsets.UTF_8);
} catch (Exception e){
log.error("Caught following exception when trying to decode and decompress text {}: ", compressedAndEncodedText, e);
throw new BadRequestException(Constants.ErrorMessages.COMPRESSED_GROUPS_HEADER_ERROR);
}
}
However, when I test this on a large set of random strings, my "compressed" string is larger than the original string. Even for a relatively small random string, the compressed data is longer. For example, this unit test fails:
#Test
public void testCompressDecompressRandomString(){
String orig = RandomStringUtils.random(71, true, true);
String compressedString = compressAndEncodeBase64(orig.toString());
Assertions.assertTrue((orig.toString().length() - compressedString.length()) > 0, "The decompressed string has length " + orig.toString().length() + ", while compressed string has length " + compressedString.length());
}
Anyone can explain what's going on and a possible alternative?
Note: I tried using the deflater without the base64 encoding:
public static String compress(String data) {
Deflater new_deflater = new Deflater();
new_deflater.setInput(data.getBytes(StandardCharsets.UTF_8));
new_deflater.finish();
byte compressed_string[] = new byte[1024];
int compressed_size = new_deflater.deflate(compressed_string);
byte[] returnValues = new byte[compressed_size];
System.arraycopy(compressed_string, 0, returnValues, 0, compressed_size);
log.info("The Original String: " + data + "\n Size: " + data.length());
log.info("The Compressed String Output: " + new String(compressed_string) + "\n Size: " + compressed_size);
return new String(returnValues, StandardCharsets.UTF_8);
}
My test still fails however.
First off, you aren't going to get much or any compression on short strings. Compressors need more data to both collect statistics on the data and to have previous data in which to look for repeated strings.
Second, if you're testing with random data, you are further crippling the compressor, since now there are no repeated strings. For your test case with random alphanumeric strings, the only compression you can get is to take advantage of the fact that there are only 62 possible values for each byte. That can be compressed by a factor of log(62)/log(256) = 0.744. Even then, you need to have enough input to cancel the overhead of the code description. Your test case of 71 characters will always be compressed to 73 bytes by deflate, which is essentially just copying the data with a small overhead. There isn't enough input to justify the code description to take advantage of the limited character set. If I have 1,000,000 random characters from that set of 62, then deflate can compress that to about 752,000 bytes.
Third, you are then expanding the resulting compressed data by a factor of 1.333 by encoding it using Base64. So if I take that compression by a factor of 0.752 and then expand it by 1.333, I get an overall expansion of 1.002! You won't get anywhere that way on random characters from a set of 62, no matter how long the input is.
Given all that, you need to do your testing on real-world inputs. I suspect that your application does not have randomly-generated data. Don't attempt compression on short strings. Combine your strings into much longer input, so that the compressor has something to work with. If you must encode with Base64, then you must. But expect that there may be expansion instead of compression. You could include in your output format an option for chunks to be compressed or not compressed, indicated by a leading byte. Then when compressing, if it doesn't compress, send it without compression instead. You can also try a more efficient encoding, e.g. Base85, or whatever number of characters you can transmit transparently.

Java - My Huffman decompression doesn't work in rare cases. Can't figure out why

I just finished coding a Huffman compression/decompression program. The compression part of it seems to work fine but I am having a little bit of a problem with the decompression. I am quite new to programming and this is my first time doing any sort of byte manipulation/file handling so I am aware that my solution is probably awful :D.
For the most part my decompression method works as intended but sometimes it drops data after decompression (aka my decompressed file is smaller than my original file).
Also whenever I try to decompress a file that isnt a plain text file (for example a .jpg) the decompression returns a completely empty file (0 bytes), the compression compresses these other types of files just fine though.
Decompression method:
public static void decompress(File file){
try {
BitFileReader bfr = new BitFileReader(file);
int[] charFreqs = new int[256];
TreeMap<String, Integer> decodeMap = new TreeMap<String, Integer>();
File nF = new File(file.getName() + "_decomp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
DataInputStream data = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
int uniqueBytes;
int counter = 0;
int byteCount = 0;
uniqueBytes = data.readUnsignedByte();
// Read frequency table
while (counter < uniqueBytes){
int index = data.readUnsignedByte();
int freq = data.readInt();
charFreqs[index] = freq;
counter++;
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillDecodeMap(tree, new StringBuffer(), decodeMap);
// Skip BitFileReader position to actual compressed code
bfr.skip(uniqueBytes*5);
// Get total number of compressed bytes
for(int i=0; i<charFreqs.length; i++){
if(charFreqs[i] > 0){
byteCount += charFreqs[i];
}
}
// Decompress data and write
counter = 0;
StringBuffer code = new StringBuffer();
while(bfr.hasNextBit() && counter < byteCount){
code.append(""+bfr.nextBit());
if(decodeMap.containsKey(code.toString())){
bfw.writeByte(decodeMap.get(code.toString()));
code.setLength(0);
counter++;
}
}
bfw.close();
bfr.close();
data.close();
System.out.println("Decompression successful!");
}
catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
File f = new File("test");
compress(f);
f = new File("test_comp");
decompress(f);
}
}
When I compress the file I save the "character" (byte) values and the frequencies of each unique "character" + the compressed bytes in the same file (all in binary form). I then use this saved info to fill by charFreqs array in my decompress() method and then use that array to build my tree. The formatting of the saved structure looks like this:
<n><value 1><frequency>...<value n><frequency>[the compressed bytes]
(without the <> of course) where n is the number of unique bytes/characters I have in my original text (AKA my leaf values).
I have tested my code a bit and the bytes seem to get dropped somewhere in the while() loop at the bottom of my decompress method (charFreqs[] and the tree seem to retain all the original byte values).
EDIT: Upon request I have now shortened my post a bit in an attempt to make it less cluttered and more "straight to the point".
EDIT 2: I fixed it (but not fully)! The fault was in my BitFileWriter and not in my decompress method. My decompression still does not function properly though. Whenever I try to decompress something that isn't a plain text file (for example a .jpg) it returns a empty "decompressed" file (0 bytes in size). I have no idea what is causing this...

Read faster a file & convert it into HEX

I need to read a file that is in ascii and convert it into hex before applying some functions (search for a specific caracter)
To do this, I read a file, convert it in hex and write into a new file. Then I open my new hex file and I apply my functions.
My issue is that it makes way too much time to read and convert it (approx 8sec for a 9Mb file)
My reading method is :
public static void convertToHex2(PrintStream out, File file) throws IOException {
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int value = 0;
StringBuilder sbHex = new StringBuilder();
StringBuilder sbResult = new StringBuilder();
while ((value = bis.read()) != -1) {
sbHex.append(String.format("%02X ", value));
}
sbResult.append(sbHex);
out.print(sbResult);
bis.close();
}
Do you have any suggestions to make it faster ?
Did you measure what your actual bottleneck is? Because you seem to read very little amount of data in your loop and process that each time. You might as well read larger chunks of data and process those, e.g. using DataInputStream or whatever. That way you would benefit more from optimized reads of your OS, file system, their caches etc.
Additionally, you fill sbHex and append that to sbResult, to print that somewhere. Looks like an unnecessary copy to me, because sbResult will always be empty in your case and with sbHex you already have a StringBuilder for your PrintStream.
Try this:
static String[] xx = new String[256];
static {
for( int i = 0; i < 256; ++i ){
xx[i] = String.format("%02X ", i);
}
}
and use it:
sbHex.append(xx[value]);
Formatting is a heavy operation: it does not only the coversion - it also has to look at the format string.

How to unpack a binary file in java?

May somebody help me to know how can I do in java what I do in ruby with the code below.
The ruby code below uses unpack('H*')[0] to stores the complete binary file content in variable "var" in ASCII format.
IO.foreach(ARGV[0]){ |l|
var = l.unpack('H*')[0]
} if File.exists?(ARGV[0])
Update:
Hi Aru. I've tested the way you say in the form below
byte[] bytes = Files.readAllBytes(testFile.toPath());
str = new String(bytes,StandardCharsets.UTF_8);
System.out.println(str);
But when I print the content of variable "str", the printout shows only little squares, like is not decoding the content. I'd like to store in "str" the content of binary file in ASCII format.
Update #2:
Hello Aru, I'm trying to store in array of bytes all the binary file's content but I don't know how to do it. It worked
with "FileUtils.readFileToByteArray(myFile);" but this is an external library, is there a built in option to do it?
File myFile = new File("./Binaryfile");
byte[] binary = FileUtils.readFileToByteArray(myFile); //I have issues here to store in array of bytes all binary content
String hexString = DatatypeConverter.printHexBinary(binary);
System.out.println(hexString);
Update #3:
Hello ursa and Aru, Thanks for your help. I've tried both of your solutions and works so fine, but seeing Files.readAllBytes() documentation
it says that is not intended to handle big files and the binary file I want to analyse is more than 2GB :(. I see an option with your solutions, read
chunk by chunk. The chunks inside the binary are separated by the sequence FF65, so is there a way to tweak your codes to only process one chunk at a
time based on the chunk separator? If not, maybe with some external library.
Update #4:
Hello, I'm trying to modify your code since I'd like to read variable size chunks based of
value of "Var".
How can I set an offset to read the next chunk in your code?
I mean,
- in first iteration read the first 1024,
- In this step Var=500
- in 2d iteration read the next 1024 bytes, beginning from 1024 - Var = 1024-500 = 524
- In this step Var=712
- in 3rd iteration read the next 1024 bytes, beginning from 1548 - Var = 1548-712 = 836
- and so on
is there a method something like read(number of bytes, offset)?
You can use commons-codec Hex class + commons-io FileUtils class:
byte[] binary = FileUtils.readFileToByteArray(new File("/Users/user/file.bin");
String hexEncoded = Hex.encodeHex(binary);
But if you just want to read content of TEXT file you can use:
String content = FileUtils.readFileToString(new File("/Users/user/file.txt", "ISO-8859-1");
With JRE 7 you can use standard classes:
public static void main(String[] args) throws Exception {
Path path = Paths.get("path/to/file");
byte[] data = Files.readAllBytes(path);
char[] hexArray = "0123456789ABCDEF".toCharArray();
char[] hexChars = new char[data.length * 2];
for ( int j = 0; j < data.length; j++ ) {
int v = data[j] & 0xFF;
hexChars[j * 2] = hexArray[v >>> 4];
hexChars[j * 2 + 1] = hexArray[v & 0x0F];
}
System.out.println(new String(hexChars));
}
This should do what you want:
try {
File inputFile = new File("someFile");
byte inputBytes[] = Files.readAllBytes(inputFile.toPath());
String hexCode = DatatypeConverter.printHexBinary(inputBytes);
System.out.println(hexCode);
} catch (IOException e) {
System.err.println("Couldn't read file: " + e);
}
If you don't want to read the entire file at once, you can do so as well. You'll need an InputStream of some sort.
File inputFile = new File("C:\\Windows\\explorer.exe");
try (InputStream input = new FileInputStream(inputFile)) {
byte inputBytes[] = new byte[1024];
int readBytes;
// Read until all bytes were read
while ((readBytes = input.read(inputBytes)) != -1) {
System.out.printf("%4d bytes were read.\n", readBytes);
System.out.println(DatatypeConverter.printHexBinary(inputBytes));
}
} catch (FileNotFoundException ex) {
System.err.println("Couldn't read file: " + ex);
} catch (IOException ex) {
System.err.println("Error while reading file: " + ex);
}

computing checksum for an input stream

I need to compute checksum for an inputstream(or a file) to check if the file contents are changed. I have this below code that generates a different value for each execution though I'm using the same stream. Can someone help me to do this right?
public class CreateChecksum {
public static void main(String args[]) {
String test = "Hello world";
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
}
public static String checkSum(InputStream fis){
String checksum = null;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
//Using MessageDigest update() method to provide input
byte[] buffer = new byte[8192];
int numOfBytesRead;
while( (numOfBytesRead = fis.read(buffer)) > 0){
md.update(buffer, 0, numOfBytesRead);
}
byte[] hash = md.digest();
checksum = new BigInteger(1, hash).toString(16); //don't use this, truncates leading zero
} catch (Exception ex) {
}
return checksum;
}
}
You're using the same stream object for both calls - after you've called checkSum once, the stream will not have any more data to read, so the second call will be creating a hash of an empty stream. The simplest approach would be to create a new stream each time:
String test = "Hello world";
byte[] bytes = test.getBytes(StandardCharsets.UTF_8);
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
System.out.println("MD5 checksum for file using Java : "
+ checkSum(new ByteArrayInputStream(bytes)));
Note that your exception handling in checkSum really needs fixing, along with your hex conversion...
Check out the code in org/apache/commons/codec/digest/DigestUtils.html
Changes on a file are relatively easy to monitor, File.lastModified() changes each time a file is changed (and closed). There is even a build-in API to get notified of selected changes to the file system: http://docs.oracle.com/javase/tutorial/essential/io/notification.html
The hashCode of an InputStream is not suitable to detect changes (there is no definition how an InputStream should calculate its hashCode - quite likely its using Object.hashCode, meaning the hashCode doesn't depend on anything but object identity).
Building an MD5 like you try works, but requires reading the entire file every time. Quite a performance killer if the file is large and/or watching for multiple files.
You are confusing two related, but different responsibilities.
First you have a Stream which provides stuff to be read. Then you have a checksum on that stream; however, your implementation is a static method call, effectively divorcing it from a class, meaning that nobody has the responsibility for maintaining the checksum.
Try reworking your solution like so
public ChecksumInputStream implements InputStream {
private InputStream in;
public ChecksumInputStream(InputStream source) {
this.in = source;
}
public int read() {
int value = in.read();
updateChecksum(value);
return value;
}
// and repeat for all the other read methods.
}
Note that now you only do one read, with the checksum calculator decorating the original input stream.
The issue is after you first read the inputstream. The pos has reach the end. The quick way to resolve your issue is
ByteArrayInputStream bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));
bis = new ByteArrayInputStream(test.getBytes());
System.out.println("MD5 checksum for file using Java : " + checkSum(bis));

Categories