Character strings to binary string - why are some characters multi-byte? - java

This code is supposed to convert a character strings to binary ones, but with a few strings, it returns a String with 16 binary digits, not 8 as I expected them to be.
public class aaa {
public static void main(String argv[]){
String nux="ª";
String nux2="Ø";
String nux3="(";
byte []bites = nux.getBytes();
byte []bites2 = nux2.getBytes();
byte []bites3 = nux3.getBytes();
System.out.println(AsciiToBinary(nux));
System.out.println(AsciiToBinary(nux2));
System.out.println(AsciiToBinary(nux3));
System.out.println("number of bytes :"+bites.length);
System.out.println("number of bytes :"+bites2.length);
System.out.println("number of bytes :"+bites3.length);
}
public static String AsciiToBinary(String asciiString){
byte[] bytes = asciiString.getBytes();
StringBuilder binary = new StringBuilder();
for (byte b : bytes)
{
int val = b;
for (int i = 0; i < 8; i++)
{
binary.append((val & 128) == 0 ? 0 : 1);
val <<= 1;
}
binary.append(' ');
}
return binary.toString();
}
}
in the first two strings, I don't understand why they return 2 bytes, since they are single-character strings.
Compiled here to: https://ideone.com/AbxBZ9
This returns:
11000010 10101010
11000011 10011000
00101000
number of bytes :2
number of bytes :2
number of bytes :1
I am using this code: Convert A String (like testing123) To Binary In Java
NetBeans IDE 8.1

A character is not always 1-byte long. Think about it - many languages, such as Chinese or Japanese, have thousands of characters, how would you map those characters to bytes?
You are using UTF-8 (one of the many, many ways of mapping characters to bytes) - looking up a character table for UTF-8, and searching for the sequence 11000010 10101010, I arrive at
U+00AA ª 11000010 10101010
Which is the UTF-8 encoding for ª. UTF-8 is often the default character encoding (charset) for Java -- but you cannot rely on this. That is why you should always specify a charset when converting strings to bytes or vice-versa

you can understand why some character are two bytes by running this simple code
// integer - binary
System.out.println(Byte.MIN_VALUE);
// -128 - 0b11111111111111111111111110000000
System.out.println(Byte.MAX_VALUE);
// 127 - 0b1111111
System.out.println((int) Character.MIN_VALUE);
// 0 - 0b0
System.out.println((int) Character.MAX_VALUE);
// 65535 - 0b1111111111111111
as you can see ,we can show Byte.MAX_VALUE with just 7 bits or 1 byte (01111111)
if you cast Character.MIN_VALUE to integer, it will be : 0
we can show it's binary format with one bit or 1 byte (00000000)!
but what about Character.MAX_VALUE ?
in binary format it's
1111111111111111 which is 65535 in decimal format and can be shown with 2 bytes (11111111 11111111).
so characters which their decimal format is between 0 and 65535 can be shown with 1 or 2 bytes.
hope you understand.

Related

Why reading a character that has no ASCII representation with System.in doesn't give the character in two bytes?

import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
char ch = '诶';
System.out.println((int)ch);
int c;
while ((c = System.in.read()) != -1)
{
System.out.println(c);
}
}
}
Output:
35830
Here, The value that represents the char 诶 in unicode is 35830. In binary, It'll be 10001011 11110110.
When I enter that character in the terminal, I expected to get two bytes, 10001011 and 11110110. and when combining them again, I can be able to obtain the original char.
But what I actually get is:
232
175
182
10
I can see that 10 represent the newline character. But what does the first 3 numbers mean?
UTF-8 is a multi-byte variable-length encoding.
In order for something reading a stream of bytes to know that there are more bytes to be read in order to finish the current codepoint, there are some values that just cannot occur in a valid UTF-8 byte stream. Basically, certain patterns indicate "hang on, I'm not done".
There's a table which explains it here. For a codepoint in the range U+0800 to U+FFFF, it needs 16 bits to represent it; its byte representation consists of 3 bytes:
1st byte 2nd byte 3rd byte
1110xxxx 10xxxxxx 10xxxxxx
You are seeing 232 175 182 because those are the bytes of the UTF-8 encoding.
byte[] bytes = "诶".getBytes(StandardCharsets.UTF_8);
for (byte b : bytes) {
System.out.println((0xFF & b) + " " + Integer.toString(0xFF & b, 2));
}
Ideone demo
Output:
232 11101000
175 10101111
182 10110110
So the 3 bytes follow the pattern described above.

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters
For example:
StringBuilder text = new StringBuilder();
IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
int letters = text.length();
int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
System.out.println(letters); // expected 160, actual 160
System.out.println(bytes); // expected 140, actual 160
Always letters = bytes, but the expected is letters > bytes.
the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library
Can anyone explain it to me please and guide me to the right way, Thanks in advance (:
(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII
‭01100001‬
Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.
Which concludes, in pure ASCII letters must always equal bytes.
(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)
Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:
package de.scrum_master.stackoverflow;
import java.util.BitSet;
public class ASCIIConverter {
public byte[] compress(String message) {
BitSet bits = new BitSet(message.length() * 7);
int currentBit = 0;
for (char character : message.toCharArray()) {
for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
if ((character & 1 << bitInCharacter) > 0)
bits.set(currentBit);
currentBit++;
}
}
return bits.toByteArray();
}
public String decompress(byte[] compressedMessage) {
BitSet bits = BitSet.valueOf(compressedMessage);
int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
decompressedMessage.append(character);
}
return decompressedMessage.toString();
}
public static void main(String[] args) {
String[] messages = {
"Hello world!",
"This is my message.\n\tAnd this is indented!",
" !\"#$%&'()*+,-./0123456789:;<=>?\n"
+ "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
+ "`abcdefghijklmnopqrstuvwxyz{|}~",
"1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
};
ASCIIConverter asciiConverter = new ASCIIConverter();
for (String message : messages) {
System.out.println(message);
System.out.println("--------------------------------");
byte[] compressedMessage = asciiConverter.compress(message);
System.out.println("Number of ASCII characters = " + message.length());
System.out.println("Number of compressed bytes = " + compressedMessage.length);
System.out.println("--------------------------------");
System.out.println(asciiConverter.decompress(compressedMessage));
System.out.println("\n");
}
}
}
The console log looks like this:
Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!
This is my message.
And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
And this is indented!
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Based on the encoding type, Byte length would be different. Check the below example.
String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length);
// prints "10"
byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length);
// prints "22"
byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length);
// prints "10"
Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.
You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.

Why am i getting 3 bytes instead 1 byte after hexadecimal/string/byte conversion in java?

I have this program:
String hexadecimal = "AF";
byte decimal[] = new byte[hexadecimal.length()/2];
int j = 0;
for ( int i = 0; i < decimal.length; i++)
{
decimal[i] = (byte) Integer.parseInt(hexadecimal.substring(j,j+2),16); //Maybe the problem is this statement
j = j + 2;
}
String s = new String(decimal);
System.out.println("TOTAL LEN: " + s.length());
byte aux[] = s.getBytes();
System.out.println("TOTAL LEN: " + aux.length);
The first total is "1" and the second one is "3", i thought i would will get "1" in the second total. Why is happen this? My intention is generate another hexadecimal string with the same value as the original string (AF), but i am having this issue.
Regards!
P.D. Sorry for my english, let me know if i explained myself well.
Don't know what exactly you try to achieve. But find below what you are doing.
Integer.parseInt(hexadecimal.substring(j, j + 2), 16) returns 175
(byte) 175 is -81
new String(decimal) tries to create an String from this byte array related to your current character set (probably it's UTF-8)
As the byte array does not contain a valid representation of UTF-8 bytes the created String contains the "REPLACEMENT CHARACTER" for the Unicode codepoint U+FFFD. The UTF-8 byte representation for this codepoint is EF BF BD (or -17 -65 -67). That's why the second length is three.
Have a look here Wikipedia UTF-8. Any character with a codepoint <= 7F can be represented by a single byte. For all other characters the first byte must have the bits 7 and 6 set 11....... Which is not the case for the value -81 which is 10101111. There for this is not a valid codepoint and it's replaced with the "REPLACEMENT CHARACTER".

Type cast issue in java (ascii value to char)

The problem I am facing occurs when I try to type cast some ASCII values to char.
For example:
(char)145 //returns ?
(char)129 //also returns ?
but it is supposed to return a different character. It happens to many other values as well.
I hope I have been clear enough.
ASCII is a 7-bit encoding system. Some programs even use this to detect if a file is binary or textual. Characters below 32 are escape characters and are used as directives (for instance new lines, print command)
The program however will still work. A character is simply stored as a short (thus sixteen bits). But the values in that range don't have an interpretation. This means that the textual output of both values will lead to nothing. On the other hand comparisons like (char) 145 == (char) 129 will still work (return false). Simply because for a processor, there is no difference between a short and a character.
If you are interested in converting your value such that only the lowest seven bits count (this modifying the value such that it is in the valid range), you can use masking:
int value = 145;
value &= 0x7f;
char c = (char) value;
The char type is Unicode 16 bit, UTF-16. So you could do (char) 265 for c-with-circumflex. ASCII is 7 bits 0 - 127.
String s = "" + ((char)145) + ((char)129);
The above is a string of two Unicode characters (each 2 bytes, UTF-16).
byte[] bytes = s.getBytes(StandardCharsets.US_ASCII); // ASCII with '?' as 7bit
s = new String(bytes, StandardCharsets.US_ASCII); // "??"
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1); // ISO-8859-1 with Latin1
byte[] bytes = s.getBytes("Windows-1252"); // With Windows Latin1
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // No information loss.
s = new String(bytes, StandardCharsets.UTF_9); // Orinal string.
In java String/char/Reader/Writer tackle text (in Unicode), whereas byte[]/InputStream/OutputStream tackle binary data, bytes.
And for bytes must always be associated with an encoding to give text.
Answer: as soon as there is a conversion from text to some encoding that does not represent that char, a question mark can be written.
These expressions evaluate to true:
((char) 145) == '\u0091';
((char) 129) == '\u0081';
These UTF-16 values map to the Unicode code points U+0091 and U+0081:
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
These are both control characters without visible graphemes (the question mark acts as a substitution character) and one of them is private use so has no designated purpose. Neither are in the ASCII set.

Converting between decimal and hexadecimal in Java

Let's say I have a byte value of 0x12 (decimal 18). I need to convert this into decimal 12. Here's how I do it:
byte hex = 0x12;
byte dec = Byte.parseByte(String.format("%2x", hex));
System.out.println(dec); // 12
Is there a better way of doing this (for example without using strings and parsing)?
Try this:
byte dec = (byte)(hex / 16 * 10 + hex % 16);
Note that it assumes that the original input is a valid BCD encoding.
If you want to iterpret a Binary coded decimal and convert to binary, you can convert to a hexadecimal string and interpret as a decimal string. Your method is a perfectly valid way to do it (be careful, though, that bytes are signed in Java.
Read up on Binary Coded Decimals here: http://en.wikipedia.org/wiki/Binary-coded_decimal
If you want to avoid string conversion, you could separate each nibble (four bytes) and multiply by the correct value:
byte hex = 0x12;
if( hex&15 > 9 || hex>>>4 > 9)
throw new NumberFormatException(); //check for valid input
byte dec = (byte)((hex & 15) + 10*(hex>>>4 & 15));
If your input is wider than a byte, this method gets out of hand easily as you need to handle each nibble separately.
Hex to Decimal - Integer.parseInt(str,16)
Decimal to Hex - Integer.toHexString(195012)

Categories