String to binary and vice versa: extended ASCII - java

I want to convert a String to binary by putting it in a byte array (String.getBytes[]) and then store the binary string for each byte (Integer.toBinaryString(bytearray)) in a String[]. Then I want to convert back to normal String via Byte.parseByte(stringarray[i], 2). This works great for standard ASCII-Table, but not for the extended one. For example, an A gives me 1000001, but an Ä returns
11111111111111111111111111000011
11111111111111111111111110000100
Any ideas how to manage this?
public class BinString {
public static void main(String args[]) {
String s = "ä";
System.out.println(binToString(stringToBin(s)));
}
public static String[] stringToBin(String s) {
System.out.println("Converting: " + s);
byte[] b = s.getBytes();
String[] sa = new String[s.getBytes().length];
for (int i = 0; i < b.length; i++) {
sa[i] = Integer.toBinaryString(b[i] & 0xFF);
}
return sa;
}
public static String binToString(String[] strar) {
byte[] bar = new byte[strar.length];
for (int i = 0; i < strar.length; i++) {
bar[i] = Byte.parseByte(strar[i], 2);
System.out.println(Byte.parseByte(strar[i], 2));
}
String s = new String(bar);
return s;
}
}

First off: "extended ASCII" is a very misleading title that's used to refer to a ton of different encodings.
Second: byte in Java is signed, while bytes in encodings are usually handled as unsigned. Since you use Integer.toBinaryString() the byte will be converted to an int using sign extension (because byte values > 127 will be represented by negative values in Java).
To avoid this simply use & 0xFF to mask all but the lower 8 bit like this:
String binary = Integer.toBinaryString(byteArray[i] & 0xFF);

To expand on Joachim's point about "extended ASCII" I'd add...
Note that getBytes() is a transcoding operation that converts data from UTF-16 to the platform default encoding. The encoding varies from system to system and sometimes even between users on the same PC. This means that results are not consistent on all platforms and if a legacy encoding is the default (as it is on Windows) that data can be lost.
To make the operation symmetrical, you need to provide an encoding explicitly (preferably a Unicode encoding such as UTF-8 or UTF-16.)
Charset encoding = Charset.forName("UTF-16");
byte[] b = s1.getBytes(encoding);
String s2 = new String(b, encoding);
assert s1.equals(s2);

Related

Encrypt using reapeting XOR

i have to encrypy a string using repeating XOR with the KEY:"ICE".
I think that i made a correct algorith to do it but the solution of the problem has 5 byte less then my calculated Hex string, why? Until this 5 bytes more the string are equals.
Did i miss something how to do repeating XOR?
public class ES5 {
public static void main(String[] args) throws UnsupportedEncodingException {
String str1 = "Burning 'em, if you ain't quick and nimble";
String str2 = "I go crazy when I hear a cymbal";
String correct1 = "0b3637272a2b2e63622c2e69692a23693a2a3c6324202d623d63343c2a2622632427276527";
byte[] cr = Encript(str1.getBytes(StandardCharsets.UTF_8),"ICE");
String cr22 = HexFormat.of().formatHex(cr);
System.out.println(cr22);
System.out.println(correct1);
}
private static byte doXOR(byte b, byte b1) {
return (byte) (b^b1);
}
private static byte[] Encript(byte[] bt1, String ice) {
int x = 0;
byte[] rt = new byte[bt1.length];
for (int i=0;i< bt1.length;i++){
rt[i] = doXOR(bt1[i],(byte) (ice.charAt(x) & 0x00FF));
x++;
if(x==3)x=0;
}
return rt;
}
}
Hmmm. The String contains characters, and XOR works on bytes.
That's why the first thing is to run String.getBytes() to receive a byte array.
Here, depending on the characters and their encoding the amount of bytes can be more than the amount of characters. You may want to print and compare the numbers already.
Then you perform XOR on the bytes, which may bring you into a completely different area for characters - so you cannot rely on new String(byte[]) at all. Instead you have to create a HEX string representation of the byte[].
Finally compare this HEX string with the value in correct. To me that string already looks like a HEX representation, so do not apply HEX again.

How to convert each char in string to 8 bit int? JAVA

I've been suggested a TCP-like checksum, which consists of the sum of the (integer) sequence and ack field values, added to a character-by-character sum of the payload field of the packet (i.e., treat each character as if it were an 8 bit integer and just add them together).
I'm assuming it would go along the lines of:
char[] a = data.toCharArray();
for (int i = 0; int < len; i++) {
...
}
Though I'm pretty clueless as to how I could complete the actual conversion?
My data is string, and I wish to go through the string (converted to a char array (though if there's a better way to do this let me know!)) and now I'm ready to iterate though how does one convert each character to an int. I will then be summing the total.
As String contains Unicode, and char is a two-byte UTF-16 implementation of Unicode, it might be better to first convert the String to bytes:
byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
data = new String(bytes, StandardCharsets.UTF_8); // Inverse.
int crc = 0;
for (byte b : bytes) {
int n = b & 0xFF; // An int 0 .. 255 without sign extension
crc ^= n;
}
Now you can handle any Unicode content of a String. UTF-8 is optimal when sufficient ASCII letters are used, like Chinese HTML pages. (For a Chinese plain text UTF-16 might be better.)

Convert String to/from byte array without encoding

I have a byte array read over a network connection that I need to transform into a String without any encoding, that is, simply by treating each byte as the low end of a character and leaving the high end zero. I also need to do the converse where I know that the high end of the character will always be zero.
Searching the web yields several similar questions that have all got responses indicating that the original data source must be changed. This is not an option so please don't suggest it.
This is trivial in C but Java appears to require me to write a conversion routine of my own that is likely to be very inefficient. Is there an easy way that I have missed?
No, you aren't missing anything. There is no easy way to do that because String and char are for text. You apparently don't want to handle your data as text—which would make complete sense if it isn't text. You could do it the hard way that you propose.
An alternative is to assume a character encoding that allows arbitrary sequences of arbitrary byte values (0-255). ISO-8859-1 or IBM437 both qualify. (Windows-1252 only has 251 codepoints. UTF-8 doesn't allow arbitrary sequences.) If you use ISO-8859-1, the resulting string will be the same as your hard way.
As for efficiency, the most efficient way to handle an array of bytes is to keep it as an array of bytes.
This will convert a byte array to a String while only filling the upper 8 bits.
public static String stringFromBytes(byte byteData[]) {
char charData[] = new char[byteData.length];
for(int i = 0; i < charData.length; i++) {
charData[i] = (char) (((int) byteData[i]) & 0xFF);
}
return new String(charData);
}
The efficiency should be quite good. Like Ben Thurley said, if performance is really such an issue don't convert to a String in the first place but work with the byte array instead.
Here is a sample code which will convert String to byte array and back to String without encoding.
public class Test
{
public static void main(String[] args)
{
Test t = new Test();
t.Test();
}
public void Test()
{
String input = "Hèllo world";
byte[] inputBytes = GetBytes(input);
String output = GetString(inputBytes);
System.out.println(output);
}
public byte[] GetBytes(String str)
{
char[] chars = str.toCharArray();
byte[] bytes = new byte[chars.length * 2];
for (int i = 0; i < chars.length; i++)
{
bytes[i * 2] = (byte) (chars[i] >> 8);
bytes[i * 2 + 1] = (byte) chars[i];
}
return bytes;
}
public String GetString(byte[] bytes)
{
char[] chars = new char[bytes.length / 2];
char[] chars2 = new char[bytes.length / 2];
for (int i = 0; i < chars2.length; i++)
chars2[i] = (char) ((bytes[i * 2] << 8) + (bytes[i * 2 + 1] & 0xFF));
return new String(chars2);
}
}
Using deprecated constructor String(byte[] ascii, int hibyte)
String string = new String(byteArray, 0);
String is already encoded as Unicode/UTF-16. UTF-16 means that it can take up to 2 string "characters"(char) to make one displayable character. What you really want is to use is:
byte[] bytes = System.Text.Encoding.Unicode.GetBytes(myString);
to convert a String to an array of bytes. This does exactly what you did above except it is 10 times faster in performance. If you would like to cut the transmission data nearly in half, I would recommend converting it to UTF8 (ASCII is a subset of UTF8) - the format the internet uses 90% of the time, by calling:
byte[] bytes = Encoding.UTF8.GetBytes(myString);
To convert back to a string use:
String myString = Encoding.Unicode.GetString(bytes);
or
String myString = Encoding.UTF8.GetString(bytes);

How-to get chars represented by a range of ASCII values of a specific charset?

What I'm trying to do is generate an array of chars that represent certain ASCII values in a certain ISO/IEC charset. Let's say, if I'm intersted in ASCII values 211-217 of ISO/IEC 8859-7 charset then the result should be { Σ, Τ, Υ, Φ, Χ, Ψ, Ω }. I tried this:
for (int i = 211; i <= 217; i++) {
System.out.println(String.valueOf((char)i));
}
But the results are based on the default system charset.
You cannot convert individual character codes in particular encoding to chars directly, therefore you need to use byte[] to String conversion instead. Since ISO-8859-7 is a single-byte encoding, each character code corresponds to one byte:
Charset cs = Charset.forName("ISO-8859-7");
for (int i = 211; i <= 217; i++) {
String s = new String(new byte[] { (byte) i }, cs)
System.out.println(
String.format("Character %s, codepoint %04X", s, (int) s.charAt(0)));
}
EDIT: Using the output format given above you can make sure that Unicode code points are decoded correctly, as specified by ISO-8859-7. If you still see ?s instead of characters, it's a problem with output - your console doesn't support these characters.
Check a result of System.getProperty("file.encoding") - it should be some kind of Unicode (UTF-8, etc). If you run your code from IDE check its configuration for console encoding settings.
Your question isn't totally clear. I think what you mean is that you have ISO-8859-7–encoded characters, and you want to convert them to Java characters (which are UTF-16–encoded Unicode points).
In that case, try this:
byte[] encoded = new byte[7];
for (int e = 211; e <= 217; ++e)
encoded[e - 211] = (byte) e;
String s = new String(encoded, "ISO-8859-7");
for (int idx = 0; idx < s.length(); ++idx)
System.out.println(s.charAt(idx));

My java class implementation of XOR encryption has gone wrong

I am new to java but I am very fluent in C++ and C# especially C#. I know how to do xor encryption in both C# and C++. The problem is the algorithm I wrote in Java to implement xor encryption seems to be producing wrong results. The results are usually a bunch of spaces and I am sure that is wrong. Here is the class below:
public final class Encrypter {
public static String EncryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
public static String DecryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
}
Strings in Java are Unicode - and Unicode strings are not general holders for bytes like ASCII strings can be.
You're taking a string and converting it to bytes without specifying what character encoding you want, so you're getting the platform default encoding - probably US-ASCII, UTF-8 or one of the Windows code pages.
Then you're preforming arithmetic/logic operations on these bytes. (I haven't looked at what you're doing here - you say you know the algorithm.)
Finally, you're taking these transformed bytes and trying to turn them back into a string - that is, back into characters. Again, you haven't specified the character encoding (but you'll get the same as you got converting characters to bytes, so that's OK), but, most importantly...
Unless your platform default encoding uses a single byte per character (e.g. US-ASCII), then not all of the byte sequences you will generate represent valid characters.
So, two pieces of advice come from this:
Don't use strings as general holders for bytes
Always specify a character encoding when converting between bytes and characters.
In this case, you might have more success if you specifically give US-ASCII as the encoding. EDIT: This last sentence is not true (see comments below). Refer back to point 1 above! Use bytes, not characters, when you want bytes.
If you use non-ascii strings as keys you'll get pretty strange results. The bytes in the kbytes array will be negative. Sign-extension then means that val will come out negative. The cast to char will then produce a character in the FF80-FFFF range.
These characters will certainly not be printable, and depending on what you use to check the output you may be shown "box" or some other replacement characters.

Categories