How to print extended value ascii for a string in java? - java

I have requirement where i have to print ascii value for a string, when i try printing the values its printing unexpected value,my program looks like below
int s=1161;
String hex=Integer.toHexString(1161);
hex="0"+hex;
char firstByte = (char) (Integer.parseInt(hex.substring(0,2),16));
char secondByte = (char) (Integer.parseInt(hex.substring(2,4),16));
and the output if the program is
first byte-- some rectangle shape
second byte--?
where i'm expecting the ascii code are
first byte-- EOT
second byte--‰
can some one help me how can i achieve this?

You intend to do tbe following in a somewhat convoluted way:
String hex = String.format("%04x", s); // delivering 0489
The first byte is 0x04 = 4, an ASCII control char, Ctrl-D, or EOT.
The second byte is 89, is actually out of the 7bit ASCII range. Depending on the encoding that might be the promil sign, but in Unicode would be the Unicode control character for a tab with justification.

You should try following code...
int s = 1161;
String hex = Integer.toHexString(s);
// hex="0"+hex;
char firstByte = (char) (Integer.parseInt(hex.substring(0, 2), 16));
char secondByte = (char) (Integer.parseInt(hex.substring(2, 3), 16));
System.out.println("First = " + firstByte + ", Second = " + secondByte + ", Hex " + hex);
output
First = H, Second = , Hex 489

Test your functions with more reliable input. Control characters like EOT may be represented by squares or any other kind of placeholder. Anything above 127 is not uniquely defined in ascii, so it might just show up as "?". Seems to me your function works correctly.
See also http://en.wikipedia.org/wiki/Ascii for all well defined ascii symbols.

Related

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters
For example:
StringBuilder text = new StringBuilder();
IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
int letters = text.length();
int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
System.out.println(letters); // expected 160, actual 160
System.out.println(bytes); // expected 140, actual 160
Always letters = bytes, but the expected is letters > bytes.
the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library
Can anyone explain it to me please and guide me to the right way, Thanks in advance (:
(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII
‭01100001‬
Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.
Which concludes, in pure ASCII letters must always equal bytes.
(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)
Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:
package de.scrum_master.stackoverflow;
import java.util.BitSet;
public class ASCIIConverter {
public byte[] compress(String message) {
BitSet bits = new BitSet(message.length() * 7);
int currentBit = 0;
for (char character : message.toCharArray()) {
for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
if ((character & 1 << bitInCharacter) > 0)
bits.set(currentBit);
currentBit++;
}
}
return bits.toByteArray();
}
public String decompress(byte[] compressedMessage) {
BitSet bits = BitSet.valueOf(compressedMessage);
int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
decompressedMessage.append(character);
}
return decompressedMessage.toString();
}
public static void main(String[] args) {
String[] messages = {
"Hello world!",
"This is my message.\n\tAnd this is indented!",
" !\"#$%&'()*+,-./0123456789:;<=>?\n"
+ "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
+ "`abcdefghijklmnopqrstuvwxyz{|}~",
"1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
};
ASCIIConverter asciiConverter = new ASCIIConverter();
for (String message : messages) {
System.out.println(message);
System.out.println("--------------------------------");
byte[] compressedMessage = asciiConverter.compress(message);
System.out.println("Number of ASCII characters = " + message.length());
System.out.println("Number of compressed bytes = " + compressedMessage.length);
System.out.println("--------------------------------");
System.out.println(asciiConverter.decompress(compressedMessage));
System.out.println("\n");
}
}
}
The console log looks like this:
Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!
This is my message.
And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
And this is indented!
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Based on the encoding type, Byte length would be different. Check the below example.
String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length);
// prints "10"
byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length);
// prints "22"
byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length);
// prints "10"
Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.
You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.

Will the result of String.getBytes() ever contain zeros?

I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}
It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]

Java - Converting from unicode to a string?

I can easily create a unicode character and print it with the following lines of code
String uniChar = Character.toString((char)0000);
System.out.println(uniChar);
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in "\u0000"? That way I could substring just the "0000", convert it to an int, add 3, and reverse the entire process.
I think you're looking for String#codePointAt:
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
For instance (live copy):
// String containing smiling face with smiling eyes emoji
String str = "😊";
// Get the code point
int cp = str.codePointAt(0);
// Show it
System.out.println(str + ", code point = U+" + toHex(cp));
// Increase it
++cp;
// Get the updated string (from an array of code points)
String updated = new String(new int[] { cp }, 0, 1);
// Show it
System.out.println(updated + ", code point = U+" + toHex(cp));
(toHex is just return Integer.toString(n, 16).toUpperCase();)
That outputs:
😊, code point = U+1F60A
😋, code point = U+1F60B
This code will work in both cases, for codepoints from Unicode BMP and from Unicode supplemental panes which uses 4 bytes in UTF-8 to encode a character. 4 byte code point requires 2 Java char entities to be stored, so in this case string.length() = 2.
// array will contain one or two characters
char[] chars = Character.toChars(codePoint);
// string.length will be 1 or 2
String str = new String(chars);
Unicode is a numbering of "characters" - code points - upto a 3-byte int range.
The UTF-16 encoding uses a sequance of byte pairs, and a java char is such a byte pair. The (int) cast of a char is imperfect and covers only a part of the Unicode. The correct way to convert a code point to possibly more than one char:
int codePoint = 0x263B;
char[] chars = Character.chars(codePoint);
To work with Unicode code points, one can do:
int[] codePoints = {0x2639, 0x263a, 0x263b};
String s = new String(codePoints, 0, codePoints.length);
codePoints[0} += 2;
You code use an int array of 1 code point.
In java 8 one can get an IntStream of code points:
s.codePoints().forEach(cp -> {
System.out.printf("U+%X = %s%n", cp, Character.getName(cp));
};

Pad a binary String equal to zero ("0") with leading zeros in Java

Integer.toBinaryString(data)
gives me a binary String representation of my array data.
However I would like a simple way to add leading zeros to it, since a byte array equal to zero gives me a "0" String.
I'd like a one-liner like this:
String dataStr = Integer.toBinaryString(data).equals("0") ? String.format(format, Integer.toBinaryString(data)) : Integer.toBinaryString(data);
Is String.format() the correct approach? If yes, what format String should I use?
Thanks in advance!
Edit: The data array is of dynamic length, so should the number of leading zeros.
For padding with, say, 5 leading zeroes, this will work:
String.format("%5s", Integer.toBinaryString(data)).replace(' ', '0');
You didn't specify the expected length of the string, in the sample code above I used 5, replace it with the proper value.
EDIT
I just noticed the comments. Sure you can build the pattern dynamically, but at some point you have to know the maximum expected size, depending on your problem, you'll know how to determine the value:
String formatPattern = "%" + maximumExpectedSize + "s";
This is what you asked for—padding is added only when the value is zero.
String s = (data == 0) ? String.format("%0" + len + 'd', 0) : Integer.toBinaryString(data);
If what you really want is for all binary values to be padded so that they are the same length, I use something like this:
String pad = String.format("%0" + len + 'd', 0);
String s = Integer.toBinaryString(data);
s = pad.substring(s.length()) + s;
Using String.format() directly would be the best, but it only supports decimal, hexadecimal, and octal, not binary.
You could override that function in your own class:
public static String toBinaryString(int x){
byte[] b = new byte[32]; // 32 bits per int
int pos = 0;
do{
x = x >> 1; // /2
b[31-pos++] = (byte)(x % 2);
}while(x > 0);
return Arrays.toString(b);
}
would this satisfy your needs?
String dataStr = data == 0 ? "00" + Integer.toBinaryString(data) : Integer.toBinaryString(data);
edit: noticed the comment about dynamic length:
probably some of the other answers are more suited:)
This, in concept, is almost same as #Óscar López answer, but different methods are used, so i thought i should post it. Hope this is fine.
1] Building the format string
String format = "%0" + totalDigits + "d";
2] Integer to Binary Conversion
String dataStr = Integer.toBinaryString(data);
3] Padding with Leading Zeros
dataStr = String.format(format, new Integer(dataStr));
The major difference here is the 3rd step. I believe, its actually a hack.
#erickson is right in String.format() not supporting binary, hence, i converted the binary number to an integer (not its equivalent), i.e., "100" will be converted to hundred (100), not four(4). I then used normal formatting.
Not sure about how much optimized this code is, but, i think its more easy to read, but, maybe, its just me.
EDIT
1] Buffer Over-run is possible for longer binary strings. Long can be used, but, even that has limitations.
2] BigInteger can be used, but, I'm sure, it will be the costliest at runtime compared to all the other methods.
So, it seems, unless only shorter binary strings are expected, replace() is the better method.
Seniors,
please correct me if I'm wrong.
Thanks.

how to get the binary values of the bytes stored in byte array

i am working on a project that gets the data from the file into a byte array and adds "0" to that byte array until the length of the byte array is 224 bits. I was able to add zero's but i am unable to confirm that how many zero's are sufficient. So i want to print the file data in the byte array in binary format. Can anyone help me?
For each byte:
cast to int (happens in the next step via automatic widening of byte to int)
bitwise-AND with mask 255 to zero all but the last 8 bits
bitwise-OR with 256 to set the 9th bit to one, making all values exactly 9 bits long
invoke Integer.toBinaryString() to produce a 9-bit String
invoke String#substring(1) to "delete" the leading "1", leaving exactly 8 binary characters (with leading zeroes, if any, intact)
Which as code is:
byte[] bytes = "\377\0\317\tabc".getBytes();
for (byte b : bytes) {
System.out.println(Integer.toBinaryString(b & 255 | 256).substring(1));
}
Output of above code (always 8-bits wide):
11111111
00000000
11001111
00001001
01100001
01100010
01100011
Try Integer.toString(bytevalue, 2)
Okay, where'd toBinaryString come from? Might as well use that.
You can work with BigInteger like below example, most especially if you have 256 bit or longer.
Put your array into a string then start from there, see sample below:
String string = "10000010";
BigInteger biStr = new BigInteger(string, 2);
System.out.println("binary: " + biStr.toString(2));
System.out.println("hex: " + biStr.toString(16));
System.out.println("dec: " + biStr.toString(10));
Another example which accepts bytes:
String string = "The girl on the red dress.";
byte[] byteString = string.getBytes(Charset.forName("UTF-8"));
System.out.println("[Input String]: " + string);
System.out.println("[Encoded String UTF-8]: " + byteString);
BigInteger biStr = new BigInteger(byteString);
System.out.println("binary: " + biStr.toString(2)); // binary
System.out.println("hex: " + biStr.toString(16)); // hex or base 16
System.out.println("dec: " + biStr.toString(10)); // this is base 10
Result:
[Input String]: The girl on the red dress.
[Encoded String UTF-8]: [B#70dea4e
binary: 101010001101000011001010010000001100111011010010111001001101100001000000110111101101110001000000111010001101000011001010010000001110010011001010110010000100000011001000111001001100101011100110111001100101110
hex: 546865206769726c206f6e20746865207265642064726573732e
You can also work to convert Binary to Byte format
try {
System.out.println("binary to byte: " + biStr.toString(2).getBytes("UTF-8"));
} catch (UnsupportedEncodingException e) {e.printStackTrace();}
Note:
For string formatting for your Binary format you can use below sample
String.format("%256s", biStr.toString(2).replace(' ', '0')); // this is for the 256 bit formatting
First initialize the byte array with 0s:
byte[] b = new byte[224];
Arrays.fill(b, 0);
Now just fill the array with your data. Any left over bytes will be 0.

Categories