If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.?
byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';
Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8.
Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. It will always use UTF-8 to encode the string.
Note that UTF-8 is a variable-length character encoding. Only the first 127 characters are encoded in one byte; all other characters will take up between 2 and 6 bytes.
Your third line casts a char to an int. This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character.
In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code:
import java.nio.charset.Charset;
public class EncodingTest {
private static void checkCharacterConversion(String c) {
byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
byte asDefaultEncoding = c.getBytes()[0];
byte directConversion = (byte)c.charAt(0);
if (asUtf8 != asDefaultEncoding) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and default encoding %d",
c, asUtf8, asDefaultEncoding));
}
if (asUtf8 != directConversion) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and direct as byte %d",
c, asUtf8, directConversion));
}
}
public static void main(String[] argv) {
// btw: first time I ever wrote a for loop with a char - feels weird to me
for (char c = '\0'; c <= '\u007f'; c++) {
String cc = new String(new char[] {c});
checkCharacterConversion(cc);
}
}
}
If you run this e.g. with:
java -Dfile.encoding="UTF-16LE" EncodingTest
you will get no output.
But of course every single byte (ok, except for the first) will be wrong if you try:
java -Dfile.encoding="UTF-16BE" EncodingTest
because in "big endian" the first byte is always zero for ascii chars.
That is because in UTF-16 an ascii character '\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy]
However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. This will not be true for any further characters, however; they all have multi-byte representations in UTF-8.
Related
I'm developing a JPEG decoder(I'm in the Huffman phase) and I want to write BinaryString's into a file.
For example, let's say we've this:
String huff = "00010010100010101000100100";
I've tried to convert it to an integer spliting it by 8 and saving it integer represantation, as I can't write bits:
huff.split("(?<=\\G.{8})"))
int val = Integer.parseInt(str, 2);
out.write(val); //writes to a FileOutputStream
The problem is that, in my example, if I try to save "00010010" it converts it to 18 (10010), and I need the 0's.
And finally, when I read :
int enter;
String code = "";
while((enter =in.read())!=-1) {
code+=Integer.toBinaryString(enter);
}
I got :
Code = 10010
instead of:
Code = 00010010
Also I've tried to convert it to bitset and then to Byte[] but I've the same problem.
Your example is that you have the string "10010" and you want the string "00010010". That is, you need to left-pad this string with zeroes. Note that since you're joining the results of many calls to Integer.toBinaryString in a loop, you need to left-pad these strings inside the loop, before concatenating them.
while((enter = in.read()) != -1) {
String binary = Integer.toBinaryString(enter);
// left-pad to length 8
binary = ("00000000" + binary).substring(binary.length());
code += binary;
}
You might want to look at the UTF-8 algorithm, since it does exactly what you want. It stores massive amounts of data while discarding zeros, keeping relevant data, and encoding it to take up less disk space.
Works with: Java version 7+
import java.nio.charset.StandardCharsets;
import java.util.Formatter;
public class UTF8EncodeDecode {
public static byte[] utf8encode(int codepoint) {
return new String(new int[]{codepoint}, 0, 1).getBytes(StandardCharsets.UTF_8);
}
public static int utf8decode(byte[] bytes) {
return new String(bytes, StandardCharsets.UTF_8).codePointAt(0);
}
public static void main(String[] args) {
System.out.printf("%-7s %-43s %7s\t%s\t%7s%n",
"Char", "Name", "Unicode", "UTF-8 encoded", "Decoded");
for (int codepoint : new int[]{0x0041, 0x00F6, 0x0416, 0x20AC, 0x1D11E}) {
byte[] encoded = utf8encode(codepoint);
Formatter formatter = new Formatter();
for (byte b : encoded) {
formatter.format("%02X ", b);
}
String encodedHex = formatter.toString();
int decoded = utf8decode(encoded);
System.out.printf("%-7c %-43s U+%04X\t%-12s\tU+%04X%n",
codepoint, Character.getName(codepoint), codepoint, encodedHex, decoded);
}
}
}
https://rosettacode.org/wiki/UTF-8_encode_and_decode#Java
UTF-8 is a variable width character encoding capable of encoding all 1,112,064[nb 1] valid code points in Unicode using one to four 8-bit bytes.[nb 2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[1][2] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[3]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" (slash) in filenames, "\" (backslash) in escape sequences, and "%" in printf.
https://en.wikipedia.org/wiki/UTF-8
Binary 11110000 10010000 10001101 10001000 becomes F0 90 8D 88 in UTF-8. Since you are storing it as text, you go from having to store 32 characters to storing 8. And because it's a well known and well designed encoding, you can reverse it easily. All the math is done for you.
Your example of 00010010100010101000100100 (or rather 00000001 0010100 0101010 00100100) converts to *$ (two unprintable characters on my machine). That's the UTF-8 encoding of the binary. I had mistakenly used a different site that was using the data I put in as decimal instead of binary.
https://onlineutf8tools.com/convert-binary-to-utf8
For a really good explanation of UTF-8 and how it can apply to the answer:
https://hackaday.com/2013/09/27/utf-8-the-most-elegant-hack/
Edit:
I took this question as a way to reduce the amount of characters needed to store values, which is a type of encoding. UTF-8 is a type of encoding. Used in a "non-standard" way, the OP can use UTF-8 to encode their strings of 0's & 1's in a much shorter format. That's how this answer is relevant.
If you concatenate the characters, you can go from 4x 8 bits (32 bits) to 8x 8 bits (64 bits) easily and encode a value as large as 9,223,372,036,854,775,807.
This question already has answers here:
Char - Java not working as intended / my code
(4 answers)
Closed 3 years ago.
I am seeing a tutorial on udemy and there the instructor says that we can store the integer variable in the char data type. But when I try to print the value ... nothing shows up
I tried assigning the "char one" value to integer variable and then get the output from int variable,It works but why can not I use the char to output the value
public static void main(String[] args) {
char one = 10;
System.out.println(one);
}
If you look at the ASCII table you would see that the character 10 represents the newline character.
This can be proved by the code below:
public static void main(String[] args) {
char one = 10;
//no newline added by print, but println adds a newline implicitly
System.out.print("Test");
System.out.print(one);
System.out.print("Test");
}
The output is:
Test
Test
Although I used System.out.print a newline was still added in the output after the first Test. So you see something was actually printed.
Furthermore, when you pass a char to the System.out.println() the char is converted to its String representation as per the ASCII table by invoking the String.valueOf(char) as char is a primitive.
For Objects when you pass a reference in the System.out.println() the toString() method of the object would be called to get its String representation.
If you change the value to char one = 65 you would see the letter A printed.
In Java char type is an int, therefore they can be converted char <-> int.
When you print an int - you get an integer number. When you print char - you get an ASCII character. char ch = 10 - is not printable character.
char ch = 'A';
System.out.println(ch); // print 'A'
int code = ch;
System.out.println(code); // print 65 - ASCII code of 'A'
Adding to the above answers, if you want to output the int value from the variable "one", a cast would work:
char one = 10;
System.out.println((int) one);
If you take a look at the ASCII Table, you can see the value of 10 is LF which is a new line. If you print this alone, it will appear to be doing nothing because it is just a new line.
However if you modify the code a bit to print some actual characters on both side of the LF char:
char c1 = 70;
System.out.print(c1);
char one = 10;
System.out.print(one);
char c2 = 71;
System.out.print(c2);
This will output:
F
G
On separate lines due to the newline in between, without it they would have printed on the same line.
Additionally you can see on that table 70 corresponds with F, and 71 with G.
Note: Java does not technically use ASCII, but rather a different encoding depending on your environment(commonly UTF-16 or ISO-8859-1), however, the characters are usually equivalent to ASCII for the amount of values the ASCII table contains (a superset). For example char c1 = 202 will print Ê for me, which is not an ASCII value.
You are misinterpreting your output and drawing the wrong conclusion.
A char is a UTF-16 code unit. UTF-16 is a character encoding for the Unicode character set. UTF-16 encodes a Unicode codepoint with one or two UTF-16 code units. Typically, if it might be two code units, you'd use String or char[] instead of char. But if your codepoint is known to take only one UTF-16 code unit, you could use char.
The codepoint you are using is U+000A 'LINE FEED (LF)'. It does take one UTF-16 code unit \u000a, which is convertible from the integer value 0xa or 10. If you inspect your output carefully, you'll "see". Perhaps adding output before and after would help.
I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}
It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]
The problem I am facing occurs when I try to type cast some ASCII values to char.
For example:
(char)145 //returns ?
(char)129 //also returns ?
but it is supposed to return a different character. It happens to many other values as well.
I hope I have been clear enough.
ASCII is a 7-bit encoding system. Some programs even use this to detect if a file is binary or textual. Characters below 32 are escape characters and are used as directives (for instance new lines, print command)
The program however will still work. A character is simply stored as a short (thus sixteen bits). But the values in that range don't have an interpretation. This means that the textual output of both values will lead to nothing. On the other hand comparisons like (char) 145 == (char) 129 will still work (return false). Simply because for a processor, there is no difference between a short and a character.
If you are interested in converting your value such that only the lowest seven bits count (this modifying the value such that it is in the valid range), you can use masking:
int value = 145;
value &= 0x7f;
char c = (char) value;
The char type is Unicode 16 bit, UTF-16. So you could do (char) 265 for c-with-circumflex. ASCII is 7 bits 0 - 127.
String s = "" + ((char)145) + ((char)129);
The above is a string of two Unicode characters (each 2 bytes, UTF-16).
byte[] bytes = s.getBytes(StandardCharsets.US_ASCII); // ASCII with '?' as 7bit
s = new String(bytes, StandardCharsets.US_ASCII); // "??"
byte[] bytes = s.getBytes(StandardCharsets.ISO_8859_1); // ISO-8859-1 with Latin1
byte[] bytes = s.getBytes("Windows-1252"); // With Windows Latin1
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // No information loss.
s = new String(bytes, StandardCharsets.UTF_9); // Orinal string.
In java String/char/Reader/Writer tackle text (in Unicode), whereas byte[]/InputStream/OutputStream tackle binary data, bytes.
And for bytes must always be associated with an encoding to give text.
Answer: as soon as there is a conversion from text to some encoding that does not represent that char, a question mark can be written.
These expressions evaluate to true:
((char) 145) == '\u0091';
((char) 129) == '\u0081';
These UTF-16 values map to the Unicode code points U+0091 and U+0081:
0091;<control>;Cc;0;BN;;;;;N;PRIVATE USE ONE;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
These are both control characters without visible graphemes (the question mark acts as a substitution character) and one of them is private use so has no designated purpose. Neither are in the ASCII set.
Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.
I'm looking a clean way to replace these characters.
Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.
N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.
We ended up implementing the following method in Java for this problem.
Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.
The offset calculations are to make sure we stay on the unicode code points.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String s) {
final int length = s.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
b.append(CharUtils.REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(CharUtils.REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}
Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:
text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.
Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.