Java algorithm to Compress small numeric number - java

I need to compress 20-40 char size of a numeric number to a 6 char size number. So far I have tried Huffman and some Zip algorithms but not getting the desired outcome.
Can some one please advise any other Algorithm/API for this work in Java?
Example:
Input: 98765432101234567890
Desired Output: 123456
Please note: I didn't mean the output has to come as 12345 for the given input. I only mean that if I specify 20 byte number, it should be compressed to a 6 byte number.
Usage: Compressed number will be feeded to a device (which can only take up-to-6 numeric chars). Device will decode the number back to the original number.
Assumption/Limits:
If required both client and device(server) can share some common
properties required for encoding/decoding the number.
Only one request can be made to a device i.e. all data should be fed
in one request, no chunk of small packets
Thanks.

This will be the best you can get assuming that any combination of digits is a legal input:
final String s = "98765432101234567890";
for (byte b : new BigInteger('0'+s).toByteArray())
System.out.format("%02x ", b & 0xff);
Prints
05 5a a5 4d 36 e2 0c 6a d2
Storing a number in binary form is theoretically the most efficient way since every combination of bits is a distinct legal value.
You may have other options only if there is more redundancy in your input, that is there are some constraints on the legal digit combinations.

The way you specify it, this is not possible. There simply are more 20 digit numbers than there are 6 digit numbers, so if you map 20 digits to only six digits, some 20 digit numbers will have to be mapped to the same six digit number. If you know that not all numbers will be valid or even have the same likelyhood, this can be used for compression, but otherwise this is impossible.
Although a reversible (bijective) mapping from 20 digit numbers to six digit numbers is impossible it is still possible to map long numbers to shorter output. This works by reducing the requirement that the output needs to be a number. The only important consideration is that the output sequence needs to have the same number of possibilities as the input. Here is an example:
There are 10^20 possible 20 digit numbers
If you use a sequence of full 8-bit ASCII (256 characters) of length x you will have 256^x possible outputs. If you solve this for x, you will notice that 256^9 > 10^20 so 9 ASCII characters are enough to encode 20^10 possible numerical inputs.
Marko's answer to the same question will tell you how to convert a number to it's byte representation which may be used as input. But be aware that this input will not be numerical and may contain many strange symbols.

Related

How is an Integer converted to ASCII in Java Behind the Scenes?

I'm not asking for a built-in class that accomplishes this, I'm just curious on how encoding works behind the scenes in java. For example, An integer in java can be stored in 4 bytes, between -2147483648 and 2147483647. Lets use 500 as the number for this demonstration. From what I understand, the computer initially stores this number in memory as 1F4 in hex, which is 00000000 00000000 00000001 11110100 in binary. When I looked up how ASCII works, it encodes each digit 0-9 to its corresponding ASCII value (0 translates to 048). However, how is the binary number stored in ram able to separate each digit so that each digit can be encoded to its corresponding ASCII value? We know that the number is 500, but this is just an abstraction. The computer just sees 1's and 0's. So how is the 5 mapped to 053, and both 0's mapped to 048 for this example. Does the jvm account for this automatically behind the scenes? Or am I misunderstanding how the entire process works. Thanks.
Well, IF I understand your question directly...
You could start by dividing your integer by 10 and getting the remainder -- the mod function in Java (%) is useful for this.
int digitInteger = originalNumber % 10;
The digitInteger variable now holds a number 0-9, the integer version of the ASCII (or Unicode, or whatever) of the last digit in the number. Unfortunately for your example, this will be a fairly boring 0.
Let us instead use the number 543 as our example, it will make things easier. If originalNumber was 543, then digitInteger will now be 3. Since 48 represents ASCII 0, then you can add that to digitInteger to get the ASCII equivalent. I'll leave it to you put that somewhere, since Java does not deal with ASCII as its default encoding. (Unicode, I believe, has the same values as ASCII for these digits).
Now you execute something like
originalNumber = originalNumber / 10;
This integer division will truncate the digit you just isolated, 3 in our case, and originalNumber is now 54.
You repeat this process - mod by 10 to isolate the last digit, convert to character, prepend to characters previously found, integer divide by 10 to truncate the last digit -- until originalNumber is 0 -- digitInteger, as you can probably see now, will be 4 on the next round and 5 on the round after that, and you prepend those to the 3 you got the first time to get your ASCII equivalent.

Size and Size on disk of a .txt file

Opened up a new file in Notepad and inserted the sentence without the quotes, "Four score and seven years ago" in it.
Four 4 characters
score 5 characters
and 3 characters
seven 5 characters
years 5 characters
ago 3 characters
TOTAL : 25 + 5 spaces = 30 characters.
You will find that the file has a size of 30 bytes on disk: 1 byte for each character.
Saved the file to disk under the name gettingSize.txt.
Then look at the size of the file.
As a rule, Each character consumes a byte.
Size : 30 bytes
Size on Disk : 4.00 KB (4,096 bytes)
The below paragraphs are copy pasted from a pdf.
If you were to look at the file as a computer looks at it, you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character (see below). So on disk, the numbers for the file look like this:
F o u r a n d s e v e n
70 111 117 114 32 97 110 100 32 115 101 118 101 110
By looking in the ASCII table, you can see a one-to-one correspondence between each character and the ASCII code used. Note the use of 32 for a space -- 32 is the ASCII code for a space. We could expand these decimal numbers out to binary numbers (so 32 = 00100000) if we wanted to be technically correct -- that is how the computer really deals with things.
1) i know that every thing is stored in the form of bits and bytes, so what generally this means - "you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character". A byte is 8 bits. So how does "each byte a number -- the number is the ASCII code". How can a byte contains a ASCII number(eg. 49 for '1') other than 0 and 1?
2) What exactly is the difference between Size and Size on Disk? And How does ASCII and Unicode fit into it?
3)In Java, Strings are objects. Can I say it be a multiple characters concated together?
String str = "Four score and seven years ago"
So how does a str stored in memory. Is it in the same manner as saving in the notepad file?
Files are stored in blocks. If file size is smaller than block size (in your case, 4KB) the file will take all block but most of its space is unused. I think this question was answered on SuperUser, i'll find the link.
UPDATE: https://superuser.com/questions/704218/why-is-there-such-a-big-difference-between-size-and-size-on-disk
To make a few short points:
"How can a byte contain an ASCII number (eg. 49 for '1') other than 0 and 1?
A Byte is 8 bits. Thus you can store numbers between 0 and 255 in it.
What is the difference between filesize and size on disk:
See MJafar Mash's answer: "size" is the actual size in bytes and "size on disk" is the number of bytes you need to allocate as blocks for the file to be placed in.
In Java Strings are Objects. Can I say that a String is multiple characters concatenated together?
Yes, but It's actually more complicated than that:
Taken from this answer:
Initializes a newly created String object so that it represents the same sequence of characters as the argument; in other words, the newly created string is a copy of the argument string. Unless an explicit copy of original is needed, use of this constructor is unnecessary since Strings are immutable.
1) i know that every thing is stored in the form of bits and bytes, so what generally this means - "you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character". A byte is 8 bits. So how does "each byte a number -- the number is the ASCII code". How can a byte contains a ASCII number(eg. 49 for '1') other than 0 and 1?
Each ASCII character occupies 1 byte. Internally, each character is stored as its ASCII number. So, you can store 8-bits of data max i.e, 2^8 -1 = 255. So the range would be 0-255.
2) What exactly is the difference between Size and Size on Disk? And How does ASCII and Unicode fit into it?
Each ASCII character is 1 byte. So, 30 bytes is the actual size of the data in the file. Next, the 4KB is the size of the segment/block in which the file is stored. In your case it is the minimum "new" space given to any file on the disk.
3)In Java, Strings are objects. Can I say it be a multiple characters concated together? String str = "Four score and seven years ago" So how does a str stored in memory. Is it in the same manner as saving in the notepad file?
Yes. Strings are indeed (internally) multiple characters concatenated together. But the characters cannot be changed.String is an object, so , they are stored as an array of characters (in java each character is 2 bytes). Java uses UTF-8 (it could be different based on various factors) as default Charset. You can also change it.

Hamming Code: Number of parity bits

I'm trying to write a method in java that will take an input of any number of 0 or 1 digits and output that line after being encoded with Hamming Code.
I have managed to write the code when knowing the number of digits the input will have (in this case 16) because knowing the number of digits in the input, I immediately know the number of parity bits there have to be added (5 in this case) to a total of 21 digits in the final output. I am working with int arrays so I need to declare a size in the beginning and my code works based on those exact sizes.
Can you guys think of any way/algorithm that can give me the number of digits the output will have (after adding the relevant parity digits to the number of input digits) based solely on the number of input digits?
Or do I have to tackle this problem in a totally different way? Any suggestions? Thank you in advance!
Cheers!
From my understanding, you get your 6th parity bit at 32 bits of input, 7th at 64, etc. so what you need is floor(lg(n)) + 1, which in java you can get by using 32 - Integer.numberOfLeadingZeros(n).
Assuming your input is made up entirely of 0s and 1s, you would do
int parityDigits = 32 - Integer.numberOfLeadingZeros(input.length());
Is your input a String or individual bits? If you input as a String, you can convert each character to a bit, and the length of the String gives you the length of the array.
If you need to input the bits one at a time, store them in an ArrayList. When all bits have been entered, you can convert your list to an array easily, or use the size of the list etc.

Algorithm / Number format for "less storage taking" numbers

I have very serious problem to address. I have a list of 75000 words. Each word is assigned with a number for easy identification. First word is assigned with 0, last word is assigned with 75000. Now, I have a list of sentences. Lets take 1 sentence for an example.
I have big dog
When you represent this with assigned numbers, it become 20 123 2332 3434. This simply means that word I appeared as the 20th word in our list, word have appeared as the 123 word in our list, word big appeared as the 2332 word and so on.
Just like this, I have more than 2 billion sentences, and I need to save/write their numerical representation. We felt that saving long numbers like 20 123 2332 3434 for 2 billion records will take a huge space. Instead, if we can represent them using a shorter number system like F3x G6e rRr it will save our storage space.
How can I achieve this? May be using Hexadecimal numbers? I used this converter and it seems there is no much difference because number 123456 in hexadecimal is 1e240 number 75000 in hexadecimal is 124f8 and so on; seems like the number of characters are the same, so I am not sure whether it is going to save any space.
Please provide me your advice to achieve this task. I will be writing this function in Java.
Decimal numbers give you 10 possibilities per byte. Hexadecimal numbers give you 16. If you could use all possible bit-patterns, you would have 256 possibilities per byte, equivalent to storing two hex digits in the space of one. Depending on how you store and retrieve your data, you may find that http://en.wikipedia.org/wiki/Base64 encoding avoids corruptions if e.g. you cannot store zero bytes, or some other bit patterns, such as bit patterns with the high bit set.
There are possibilities for more sophisticated compression. One would simply to use a standard compressor, such as that provided in Java by package java.util.Zip, or equivalents in other languages. Another - if you know how common words are, would be to simply sort the words so that common words have low numbers, and therefore shorter numbers. You could also look up http://en.wikipedia.org/wiki/Huffman_coding. This would allow you to avoid having spaces between the numbers, and would also give shorter words shorter sequences of digits.
Implement a binary representation of your string. The first 16/32 bit represent the length of the string n, then follow n 17 bit integers representing the indexes in your array of 75000 words. The number 17 is roughly the logarithm in base 2 of 75000. So your example will become (assuming 16 bit for word length):
0000 0000 0000 0100 0 0000 0000 0001 0100 0 0000 0000 0111 1011
| 4 | 20 | 123 |
0 0000 1001 0001 1100 0 0000 1101 0110 1010
| 2332 | 3434 |
Then you can convert that stream of bits to/from a binary file using for example Robert Sedgewick's BinaryIn and BinaryOut classes. Note that the string above now only requires 21 bytes to be encoded.
You could use Huffman compression to compress the binary stream if you knew the distribution of the words beforehand. This could save a lot of space if the distribution is skewed towards a small subset of the words.

Generation if a unique id of size less than 11 bytes from a string

i am developing a piece of code to generate a unique hexadecimal value from an input string. The output size must be less than 11 bytes which comes as requirement.Can someone please give me an insight into this. I have done the string to binary conversion and then the hexagonal mapping which produces a combination of alphanumeric characters but the size is always greater tha 11 bytes. I also need to regenerate the input from this unique id..Is that possible.....
Thanks in adavance
If your result must be absolutely unique and your input can be any length, then your task is impossible.
Think of it that way: how many different combinations of 11 bytes are there? 25611 (or 211*8=288).
That's a big number, right? Yes, but it's not big enough.
For simplicities sake we'll talk about ASCII strings only, so we have 128 different values (in reality there are many more possibilities for a character in a Java String, but the principle stays the same. For simplicities sake we also ignore that a \0 character in a String is kind of unlikely).
Now, there are 12813 different 13-character ASCII strings. That's 27*13 or 291 different combinations. Obviously you can't have a unique id out of 288 possible ids for 291 different strings.
Less than 11 bytes means maximum 10 bytes.
8^10 is 1073741824.
2^80 is a huge number.
So if you take your hexvalue, and take it modulo that number, you should fit into the 10 bytes. Convert the remainder back to hex.
Regenerating the input will not be possible. If your input is allowed to be longer than 11 bytes, it will not be possible. That would be an endless compression.

Categories