Compressing a string in java with limited characters allowed

Compressing a string in java with limited characters allowed - java

One of my friends got this interview question. In addition, he was told he could assume the characters were letters a to z (upper or lower case). I wrote the following, but I can't figure out how to use the assumption about the limited characters (a to z) the string contains. Am I using this assumption without realizing it or can I make use of it?
public static String compress(String str){
int count = 1;
char c = str.charAt(0);
StringBuffer result = new StringBuffer();
for (int i = 1; i < str.length();i++){
if (str.charAt(i) == c){
count++;
}
else{
String to_add = c + String.valueOf(count);
result.append(to_add);
count = 1;
c = str.charAt(i);
}
}
// last character
String to_add = c + String.valueOf(count);
result.append(to_add);
String result_str = result.toString();
// Check whether the compressed string is
// actually smaller than the original one
if (result_str.length() < str.length()){
return result_str;
}
else{
return str;
}
}

Assign each character to a number, eg a = 1, z = 26. So, to represent these 26 characters you need at least 5 bits.
You can now use 2 bytes (16 bits) to store a triplet of characters. This requires 1/3 less bytes than the initial one byte per character (if ascii). To store a triplet of characters read bits from your bytes (for example left to right).
First five bits of the first byte will represent the first character
The next three bits of the first byte, concatenated with the first two bits of the second byte represent the second
the next five bits from second byte represent the third character
there is one bit left (ignore it)
*To slightly improve in compression size, if your String's length % 3 = 1, then for the last character of your String you can use one byte only as you don't have another triplet.
**You can get if a specific bit is set on a byte using the algorithm from this post, which is:
public byte getBit(byte b, int position)
{
return (b >> position) & 1;
}
***You can set a bit to a byte using the algorithms from this post, which are:
to set a bit (set it to one)
b = b | (1 << position);
To unset a bit (set it to zero):
b = b & ~(1 << position);
****Using maths (least common multiple of 5 and 8), you could even slightly improve in compression size if you used 5 bytes = 40bits, which can represent 8 characters (8x5=40).
Then you would store octets of characters and there are no bits to ignore now. For the last characters of your String, depending on (string size % 8), you could again use less bytes.
*****Using the last 5-byte approach you get 3/8 less size, which is better than 1/3 of the 3-byte approach.

'a' to 'Z' is 2*26=52 distinct characters, and it fits in 6-bits (2^6=64). You could just pack the code-points into sextets.
OTOH, RLE (what you have coded) works only for repetitions. If you have input like abcde it would turn into 1a1b1c1d1e or something alike, which is highly inefficient and you can hardly call it compression.

Related

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters
For example:
StringBuilder text = new StringBuilder();
IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
int letters = text.length();
int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
System.out.println(letters); // expected 160, actual 160
System.out.println(bytes); // expected 140, actual 160
Always letters = bytes, but the expected is letters > bytes.
the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library
Can anyone explain it to me please and guide me to the right way, Thanks in advance (:

(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII
‭01100001‬
Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.
Which concludes, in pure ASCII letters must always equal bytes.
(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)

Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:
package de.scrum_master.stackoverflow;
import java.util.BitSet;
public class ASCIIConverter {
public byte[] compress(String message) {
BitSet bits = new BitSet(message.length() * 7);
int currentBit = 0;
for (char character : message.toCharArray()) {
for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
if ((character & 1 << bitInCharacter) > 0)
bits.set(currentBit);
currentBit++;
}
}
return bits.toByteArray();
}
public String decompress(byte[] compressedMessage) {
BitSet bits = BitSet.valueOf(compressedMessage);
int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
decompressedMessage.append(character);
}
return decompressedMessage.toString();
}
public static void main(String[] args) {
String[] messages = {
"Hello world!",
"This is my message.\n\tAnd this is indented!",
" !\"#$%&'()*+,-./0123456789:;<=>?\n"
+ "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
+ "`abcdefghijklmnopqrstuvwxyz{|}~",
"1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
};
ASCIIConverter asciiConverter = new ASCIIConverter();
for (String message : messages) {
System.out.println(message);
System.out.println("--------------------------------");
byte[] compressedMessage = asciiConverter.compress(message);
System.out.println("Number of ASCII characters = " + message.length());
System.out.println("Number of compressed bytes = " + compressedMessage.length);
System.out.println("--------------------------------");
System.out.println(asciiConverter.decompress(compressedMessage));
System.out.println("\n");
}
}
}
The console log looks like this:
Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!
This is my message.
And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
And this is indented!
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

Based on the encoding type, Byte length would be different. Check the below example.
String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length);
// prints "10"
byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length);
// prints "22"
byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length);
// prints "10"

Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.
You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.

Java int value type to Character

I am new to java and and working on a crud calculator that takes input and holds it in an ArrayList to perform the calculations.
I am trying to add two values in an ArrayList<Character> and then replace the "+" with the sum.
if(listEqu.contains('+')) {
while(listEqu.indexOf('+') > -1) {
int plus = listEqu.indexOf('+');
int prev = listEqu.get(plus-1);
int nxt = listEqu.get(plus+1);
Character sum = (char) (nxt + prev);
listEqu.set(plus, sum);
System.out.println(listEqu);
}
}
When the input is 1+1, this returns [1, b, 1].
What I want is to return [1, 2, 1] .
Any advice? Thanks!

The problem is actually that adding two characters doesn't do what you expect.
The value of '1' + '1' is 'b'. If you want the next digit after '1' you add the integer 1 to it; i.e. '1' + 1 is '2'.
For a deeper understanding, you need to understand how character data is represented in Java.
Each char value in Java is an unsigned 16 bit integer that corresponds to a code point (or character code) in the Unicode basic plane. The first 128 of these code points (0 to 127) correspond to a characters in the old ASCII character set. In ASCII the codes that represent digits are 48 (for '0') through to 39 (for '9'). And the lowercase letters are 97 (for 'a') through to 122 (for 'z').
So as you can see, '1' + '1' -> 49 + 49 -> 98 -> 'b'.
(In fact there is a lot more to it than this. Not all char values represent real characters, and some Unicode code-points require two char values. But this is way beyond the scope of your question.)
How could I specify addition of numbers instead of addition of the characters?
You convert the character (digit) to a number, perform the arithmetic, and convert the result back to a character.
Read the javadoc for the Character class; e.g. the methods Character.digit and Character.forDigit.
Note that this only works while the numbers remain in the range 0 through 9. For a number outside of that range, the character representation consists of two or more characters. For those you should be using String rather than char. (A String also copes with the 1 digit case too ...)

Few things that can be improved with your code :
Converting the characters 1 into equivalent integer value:
int prev = Integer.parseInt(String.valueOf(listEqu.get(plus-1)));
int nxt = Integer.parseInt(String.valueOf(listEqu.get(plus+1)));
// Note : int prev = listEqu.get(plus-1) would store an ascii value of `1` to prev value i.e 49
And then converting the sum of those two values into Character back to be added to the list using Character.forDigit as:
Character sum = Character.forDigit(nxt+prev,10);
// Note Character sum = (char) (nxt + prev); is inconvertible
// and char sum = (char) (nxt + prev); would store character with ascii value 98(49+49) in your case 'b' to sum

you should first convert your prevand nxt to int value and then add them together like follow:
if(listEqu.contains('+')) {
while(listEqu.indexOf('+') > -1) {
int plus = listEqu.indexOf('+');
int prev = Integer.parseInt(listEqu.get(plus-1));
int nxt = Integer.parseInt(listEqu.get(plus+1));
Character sum = (char) (nxt + prev);
listEqu.set(plus, sum);
System.out.println(listEqu);
}
}

nxt and prev are char values. Tey take their value in the ASCII table, where '1' is 61 and 'b' is 142 (thus, '1' + '1' = 'b')
You need to substract '0' to get the number they represent. ('1' - '0' = 61 - 60 = 1)
The sum is not necessarily writable with one character, so you shouldn't put it back into a char array.
If you want to convert an integer to a string, use Integer.toString(i).
(And, if you want to, get the first character of the string and put it in the array, if that's what you want)

You need to parse the characters to their corresponding decimal value before you perform the addition, and then back to a character after. The methods Character.digit(char, int) and Character.forDigit(int, int) can do that (and I would use char since that is the type of prev and nxt). Like,
char prev = listEqu.get(plus - 1);
char nxt = listEqu.get(plus + 1);
Character sum = Character.forDigit(Character.digit(nxt, 10)
+ Character.digit(prev, 10), 10);

Hash a String into fixed bit hash value

I want to hash a word into fixed bit hash value say 64 bit,32 bit (binary).
I used the following code
long murmur_hash= MurmurHash.hash64(word);
Then murmur_hash value is converted into binary by the following function
public static String intToBinary (int n, int numOfBits) {
String binary = "";
for(int i = 0; i < numOfBits; ++i) {
n/=2;
if(n%2 == 0)
{
binary="0"+binary;
}
else
binary="1"+binary;
}
return binary;
}
Is there any direct hash method to convert into binary?

Just use this
Integer.toBinaryString(int i)

If you want to convert into a fixed binary string, that is, always get a 64-character long string with zero padding, then you have a couple of options. If you have Apache's StringUtils, you can use:
StringUtils.leftPad( Long.toBinaryString(murmurHash), Long.SIZE, "0" );
If you don't, you can write a padding method yourself:
public static String paddedBinaryFromLong( long val ) {
StringBuilder sb = new StringBuilder( Long.toBinaryString(val));
char[] zeros = new char[Long.SIZE - sb.length()];
Arrays.fill(zeros, '0');
sb.insert(0, zeros);
return sb.toString();
}
This method starts by using the Long.toBinaryString(long) method, which conveniently does the bit conversion for you. The only thing it doesn't do is pad on the left if the value is shorter than 64 characters.
The next step is to create an array of 0 characters with the missing zeros needed to pad to the left.
Finally, we insert that array of zeros at the beginning of our StringBuilder, and we have a 64-character, zero-padded bit string.
Note: there is a difference between using Long.toBinaryString(long) and Long.toString(long,radix). The difference is in negative numbers. In the first, you'll get the full, two's complement value of the number. In the second, you'll get the number with a minus sign:
System.out.println(Long.toString(-15L,2));
result:
-1111
System.out.println(Long.toBinaryString(-15L));
result:
1111111111111111111111111111111111111111111111111111111111110001

Another other way is using
Integer.toString(i, radix)
you can get string representation of the first argument i in the radix ( Binary - 2, Octal - 8, Decimal - 10, Hex - 16) specified by the second argument.

Converting a int to char and then back to int - doesn't give same result always

I am trying to get a char from an int value > 0xFFFF. But instead, I always get back the same char value, that when cast to an int, prints the value 65535 (0xFFFF).
I couldn't understand why it is generating symbols for unicode > 0xFFFF.
int hex = 0x10FFFF;
char c = (char)hex;
System.out.println((int)c);
I expected the output to be 0x10FFFF. Instead, the output comes back as 65535.

This is because, while an int is 4 bytes, a char is only 2 bytes. Thus, you can't represent all values in a char that you can in an int. Using a standard unsigned integer representation, you can only represent the range of values from 0 to 2^16 - 1 == 65535 in a 2-byte value, so if you convert any number outside that range to a 2-byte value and back, you'll lose data.

int is 4 byte. char is 2 byte.
Your number was well within range an int can hold, but not which char can.
So when you converted that number to a char, it lost data and became the maximum a char can hold, which is what it printed i.e. 65535

Your number was too big to be a char which is 2 bytes. But it was small enough where it fit in as an int which is 4 bytes. 65535 is the biggest amount that fits in a char so that's why you got that value. Also, if a char was big enough to fit your number, when you returned it to an int it might have returned the decimal value for 0x10FFFF which is 1114111.

Unfortunately, I think you were expecting a Java char to be the same thing as a Unicode code point. They are not the same thing.
The Java char, as already expressed by other answers, can only support code points that can be represented in 16 bits, whereas Unicode needs 21 bits to support all code points.
In other words, a Java char on its own, only supports Basic Multilingual Plane characters (code points <= 0xFFFF). In Java, if you want to represent a Unicode code point that is in one of the extended planes (code points > 0xFFFF), then you need surrogate characters, or a pair of characters to do that. This is how UTF-16 works. And, internally, this is how Java strings work as well. Just for fun, run the following snippet to see how a single Unicode code point is actually represented by 2 characters if the code point is > 0xFFFF:
// Printing string length for a string with
// a single unicode code point: 0x22BED.
System.out.println("𢯭".length()); // prints 2, because it uses a surrogate pair.
If you want to safely convert an int value that represents a Unicode code point to a char (or chars to be more exact), and then convert it back to an int code point, you will have to use code like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
char[] surrogateChars = Character.toChars(hex);
int codePointConvertedBack = Character.codePointAt(surrogateChars, 0);
System.out.println(codePointConvertedBack); // prints 1114111
}
Alternatively, instead of manipulating char arrays, you can use a String, like this:
public static void main(String[] args) {
int hex = 0x10FFFF;
System.out.println(Character.isSupplementaryCodePoint(hex)); // prints true because hex > 0xFFFF
String s = new String(new int[] {hex}, 0, 1);
int codePointConvertedBack = s.codePointAt(0);
System.out.println(codePointConvertedBack); // prints 1114111
}
For further reading: Java Character Class

What is the purpose to add padding to hex?

Hi I read this post on how to implement salt and hashing to the password and I am stuck on specified code underneath the website I specified above.
private static String toHex(byte[] array)
{
BigInteger bi = new BigInteger(1, array);
String hex = bi.toString(16);
int paddingLength = (array.length * 2) - hex.length();
if(paddingLength > 0)
return String.format("%0" + paddingLength + "d", 0) + hex;
else
return hex;
}
My question is that why did they calculate the paddingLength and implement it to the hex if the result paddingLength is greater than zero?

BigInteger(byte[]) interprets the byte array into a two's complement value; this means that it has 2^(8*N) possible values for an N-length array (since each byte contains 8 bits).
Meanwhile, a hex string of length M has 16^M possible values (since each character encodes one of 16 values).
The authors want a one-to-one mapping between the byte[] and the String: given a String, you should be able to exactly determine the byte[] it came from. To get that, we have to make sure the string can encode exactly as many values as the byte[]. Plugging in the numbers from above, we get:
(# values for an N-length byte[]) == (# values for an M-length String)
2^(8*N) == 16^M
Let's solve for M in terms of N. The first step is to re-write that right-hand side. If you remember your exponent power rules, a^(b*c) == (a^b)^c. Let's get the base of the exponent on the right to be a 2:
== (2^4)^M
== 2^(4*M)
So we have 2^(8*N) == 2^(4*M). If 2^k == 2^j, that means k == j. So, 8*N == 4*M. Dividing both sides by 4 yields M = 2N.
To tie it back together, remember that N was the length of the byte array, and M was the length of the hex string. We've just figured out that for there to be a one-to-one mapping, M = 2N -- in other words, the hex string should be twice as long as the byte array.
The padding ensures that.

Because they wanted all the bytes in the array to be represented in the hex string, even if they are leading zero bytes.
It is not the most obvious way to write a toHex method though.
I find something like this much clearer:
private static String toHex(byte[] array) {
StringBuilder s = new StringBuilder();
for (byte b : array) {
s.append(String.format("%02x", b));
}
return s.toString();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compressing a string in java with limited characters allowed - java

Related

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

Java int value type to Character

Hash a String into fixed bit hash value

Converting a int to char and then back to int - doesn't give same result always

What is the purpose to add padding to hex?

Categories

Resources