Finding out utf-8 value of small Tethe char - java

There is something confusing about this,
I'm trying to get the utf-8 int value of the small Tetha character, which should be 225182191:
http://en.wikipedia.org/wiki/Theta#Character_Encodings
But:
public static void main(String... args){
char c='Ɵ';
System.out.println((byte)c);
}
Prints: -97 (????)
I did change my text encoding scheme on eclipse from MacRoman to UTF-8

The encoding of the text source file has nothing to do with how things are at runtime.
A Java char is a 16-bit wide value. It is always implicitly UTF-16.
When the compiler generates a .class file char literals are transcoded to UTF-16 and stored in an int structure within the class' constant pool. Strings are converted to a modified UTF-8 for compactness reasons.
When either is loaded by the JVM they are represented as UTF-16 values/sequences in memory.
Transcoding the value from UTF-16 to UTF-8:
char c = '\u03B8'; // greek small letter theta θ
for (byte b : String.valueOf(c).getBytes(StandardCharsets.UTF_8)) {
int unsigned = b & 0xFF;
System.out.append(" ").print(unsigned);
}
FYI: The three-byte decimal sequence 225 182 191 is "modifier letter small theta" and not "greek small letter theta"

It should be casted to an int, or alternatively, used as String and call the method codepointAt(0)
char c='Ɵ';
System.out.println((int)c);
System.out.println("Ɵ".codePointAt(0));

Related

What is the best way to get the size of text in bytes in Java?

I have implemented a cryptographic algorithm in Java. Now, I want to measure the size of the message before and after encryption in bytes.
How to get the size of the text in bytes?
For example, if I have a simple text Hi! I am alphanumeric (8÷4=2)
I have tried my best but can't find a good solution.
String temp = "Hi! I am alphanumeric (8÷4=2)"
temp.length() // this works because in ASCII every char takes one byte
// and in java every char in String takes two bytes so multiply by 2
temp.length() * 2
// also String.getBytes().length and getBytes("UTF-8").length
// returns same result
But in my case after decryption of message the chars becomes the mixture of ASCII and Unicode.
e.g. QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥ죉ὥ
Upper methods returns the length or length * 2
But I want to calculate the actual bytes (not in JVM). For example the char a takes one byte in general and Unicode ™ for example takes two bytes.
So how to implement this technique in Java?
I want some this likes the technique used in this website http://bytesizematters.com/
It gives me 26 bytes for this text QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥ죉ὥ although the length of text is 22.
Be aware: String is for Unicode text (being able to mix all kind of scripts) and char is two bytes UTF-16.
This means that binary data byte[] need to know its encoding/charset, and will be converted to String.
byte[] b = ...
String s = ...
b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
Without explicit charset of the bytes, the platform default is taken, which will give non-portable code.
UTF-8 will allow all text, not just some scripts, but Greek, Arab, Japanese.
However as there is a conversion involved, non-text binary data can get corrupted, will not be legal UTF-8, will cost double the memory and be slower because of the conversion.
Hence avoid String for binary data at all costs.
To respond to your question:
You might get away by StandardCharsets.ISO_8859_1 - which is a single byte encoding.
String.getBytes(StandardCharsets.ISO_8859_1).length() will then correspond to String.length() though the String might use double the memory as char is two bytes.
Alternatives to String:
byte[] themselves, Arrays provides utility functions, like arrayEquals.
ByteArrayInputStream, ByteArrayOutputStream
ByteBuffer can wrap byte[]; can read and write short/int/...
Convert the byte[] to a Base64 String using Base64.getEncoder().encode(bytes).
Converting a byte to some char
The goal is to convert a byte to a visible symbol displayable in a GUI text field, and where the length in chars is the same as the number of original bytes.
For instance the font Lucida Sans Unicode has from U+2400 symbols representing the ASCII control characters. For the bytes with an 8th bit, one could take Cyrillic, though confusion may arise because of similarity Cyrillic е and Latin e.
static char byte2char(byte b) {
if (b < 0) { // -128 .. -1
return (char)(0x400 - b);
} else if (b < 32) {
return (char)(0x2400 + b);
} else if (b == 127) {
return '\u25C1';
} else {
return (char) b;
}
}
A char is a UTF-16 encoding of Unicode, but here also correspond to a Unicode code point (int).
A byte is signed, hence ranges from -128 to 127.

String.getBytes() returns array of Unicode chars

I was reading getbytes and from documentation it states that it will return
the resultant byte array.
But when i ran the following program, i found that it is returning array of Unicode symbols.
public class GetBytesExample {
public static void main(String args[]) {
String str = new String("A");
byte[] array1 = str.getBytes();
System.out.print("Default Charset encoding:");
for (byte b : array1) {
System.out.print(b);
}
}
}
The above program prints output
Default Charset encoding:65
This 65 is equivalent to Unicode representation of A. My question is that where are the bytes whose return type is expected.
There is no PrintStream.print(byte) overload, so the byte needs to be widened to invoke the method.
Per JLS 5.1.2:
19 specific conversions on primitive types are called the widening primitive conversions:
byte to short, int, long, float, or double
...
There's no PrintStream.print(short) overload either.
The next most-specific one is PrintStream.print(int). So that's the one that's invoked, hence you are seeing the numeric value of the byte.
String.getBytes() returns the encoding of the string using the platform encoding. The result depends on which machine you run this. If the platform encoding is UTF-8, or ASCII, or ISO-8859-1, or a few others, an 'A' will be encoded as 65 (aka 0x41).
This 65 is equivalent to Unicode representation of A
It is also equivalent to a UTF-8 representation of A
It is also equivalent to a ASCII representation of A
It is also equivalent to a ISO/IEC 8859-1 representation of A
It so happens that the encoding for A is similar in a lot character encodings, and that these are all similar to the Unicode code-point. And this is not a coincidence. It is a result of the history of character set / character encoding standards.
My question is that where are the bytes whose return type is expected.
In the byte array, of course :-)
You are (just) misinterpreting them.
When you do this:
for (byte b : array1) {
System.out.print(b);
}
you output a series of bytes as decimal numbers with no spaces between them. This is consistent with the way that Java distinguishes between text / character data and binary data. Bytes are binary. The getBytes() method gives a binary encoding (in some character set) of the text in the string. You are then formatting and printing the binary (one byte at a time) as decimal numbers.
If you want more evidence of this, replace the "A" literal with a literal containing (say) some Chinese characters. Or any Unicode characters greater than \u00ff ... expressed using \u syntax.

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

Say that I would encode a Java character array (char[]) instance as bytes:
using two bytes for each character
using big endian encoding (storing the most significant 8 bits in the leftmost and the least significant 8 bits in the rightmost byte)
Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?
This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.
No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:
char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate
String str = new String(data);
The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:
CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException
(And similarly using a CharsetDecoder if you have bytes.)

strange behaviour of java getBytes vs getBytes(charset)

consider the following:
public static void main(String... strings) throws Exception {
byte[] b = { -30, -128, -94 };
//section utf-32
String string1 = new String(b,"UTF-32");
System.out.println(string1); //prints ?
printBytes(string1.getBytes("UTF-32")); //prints 0 0 -1 -3
printBytes(string1.getBytes()); //prints 63
//section utf-8
String string2 = new String(b,"UTF-8");
System.out.println(string2); // prints •
printBytes(string2.getBytes("UTF-8")); //prints -30 -128 -94
printBytes(string2.getBytes()); //prints -107
}
public static void printBytes(byte[] bytes){
for(byte b : bytes){
System.out.print(b + " " );
}
System.out.println();
}
output:
?
0 0 -1 -3
63
•
-30 -128 -94
-107
so I have two questions:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)
Question 1:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
The character set you've specified is used during character encoding of the string to the byte array (i.e. in the method itself only). It's not part of the String instance itself. You are not setting the character set for the string, the character set is not stored.
Java does not have an internal byte encoding of the character set; it uses arrays of char internally. If you call String.getBytes() without specifying a character set, it will use the platform default - e.g. Windows-1252 on Windows machines.
Question 2:
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)
You cannot always do this. Not all bytes represent a valid encoding of characters. So if such an encoded array is decoded then these kind of encodings are silently ignored, i.e. the bytes are simply skipped.
This already happens during String string1 = new String(b,"UTF-32"); and String string2 = new String(b,"UTF-8");.
You can change this behavior using an instance of CharsetDecoder, retrieved using Charset.newDecoder.
If you want to encode a random byte array into a String instance then you should use a hexadecimal or base 64 encoder. You should not use a character decoder for that.
Java String / char (16 bits UTF-16!) / Reader / Writer are for Unicode text. So all scripts may be combined in a text.
Java byte (8 bits) / InputStream / OutputStream are for binary data. If that data represents text, one needs to know its encoding to make text out of it.
So a conversion from bytes to text always needs a Charset. Often there exists an overloaded method without the charset, and then it defaults to the System.getProperty("file.encoding") which can differ on every platform.
Using a default is absolutely non-portable, if the data is cross-platform.
So you had the misconception that the encoding belonged to the String. This is understandable, seeing that in C/C++ unsigned char and byte were largely interchangeable, and encodings a nightmare.

Why can't I make a char bound to the unicode castle character in Java?

class A {
public static void main(String[] args) {
char a = '∀';
System.out.println(a);
char castle = '𝍇';
System.out.println(castle);
}
}
I can make a char for the upside down A just fine, but when I try to make the castle char it gets 3 compile errors. Why?
$ javac A.java && java A
A.java:5: unclosed character literal
char castle = '𝍇';
^
A.java:5: illegal character: \57159
char castle = '𝍇';
^
A.java:5: unclosed character literal
char castle = '𝍇';
^
3 errors
I suspect that the castle character does not fit in a single char, but rather, requires an int code point. In that case, you could use it in a String literal, but not as a char.
The Javadoc for Character states:
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.
So my guess would be that that character requires more than 16 bits, so it would need to be treated as an int code point.
If your source code file contains non-ascii characters (as does this file), you need to ensure that javac reads it with the proper encoding, otherwise it will default to an encoding that is possibly not the one in which it was saved.
So, if you saved your file in UTF-8 from your editor, you can compile it using:
javac -encoding utf8 A.java
Note that you can also use the unicode codepoint instead of the actual character, this makes the code compilable without the -encoding directive:
char a = '\u2200'; // The codepoint for the ∀ character
String castle = "\ud834\udf47"; // Being a high-surrogate char, the castle cannot fit a single 16 bit char
If you are using Windows:
Create a new empty Text Document anywhere and copy-past 𝍇 inside the text document.
Close it and right click >> Properties >> in General tab see the Size it's 2-byte.
As you see it's 2-byte you need to use two characters.
You probably want to read about UTF-16

Categories