I can't believe I'm having a hard time with this, but so far haven't found the answer: Let's say I have a Java char (or a 1-character String) and I want to convert it into a byte of ASCII. How do I do this?
I know I could look up the decimal value of the ASCII character and create a byte from that, but it seems like there should be a simple conversion. I found what seems to be the answer for byte arrays:
byte[] asciiArray = "SomeString".getBytes( StandardCharsets.US_ASCII );
but not for just a single byte. Something like:
byte asciiA = <some conversion function>('A');
If you are sure that the character in the range U+0000 to U+007F, it would be one UTF-16 code unit (char), and also in the ASCII character set, and that UTF-16 code unit would have the same value as the ASCII code unit.
You might want to add a guard because (byte)'½' wouldn't give you anything useful.
Related
I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).
I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.
I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table
Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?
EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół
You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.
There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.
ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.
UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.
You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.
I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.
TL;DR: In Java, will casting a character obtained from a String via the charAt method to a byte always yield the same value?
I am reading files which are encoded with arbitrary (unknown to us) character encodings. I need to parse these files and look for certain words, e.g. "TAG". I placed certain restrictions on the file contents, such as "when looking for a tag, the bytes for "TAG" must be the same as their ASCII representation".
For example, suppose I have the following file:
0x00 0x11 0x22 0x33 0x54 0x41 0x47 0x77 0x88 0x99 0xaa 0xbb
Since the ASCII values for T, A and G are respectively 0x54, 0x41 and 0x47, I can find "TAG" in the file by parsing the bytes themselves.
0x00 0x11 0x22 0x330x54 0x41 0x470x77 0x88 0x99 0xaa 0xbb
However, I need to hard-code the value of the bytes I am looking for. To do this, I call String's charAt(int i) method and cast the char to a byte.
Here is, for example, how I would verify an arbitrary byte (called b) for the byte representation of 'T':
String tag = "TAG";
char t = tag.charAt(0);
if ((byte)t == b){
//magic goes here, such as comparing the 'A' and the 'G'
}
Note: the code is not actually like that, and the verification algorithm is much more elegant.
This works fine on my local machine. However, this will be run on machines which may contain very strange encodings. What worries me is whether casting a character obtained with charAt to a byte might yield a different value depending on the machine. I know that Java always encodes chars with the UTF-16 character encoding, but I am worried that when converting from a String to a character and then to a byte might yield strange results.
So, in short, will casting a character obtained from a String via the charAt method to a byte always yield the same value? Or will it depend on an external factor?
Thanks for your help!
Note: I cannot hard-code the bytes themselves (in, for example, a byte array) since they can be very very long and may be changed very often in the future.
java.lang.string.charAt will always return a 16 bit UTF-16 character, which will always be the same when you cast it to a byte, though because char is a 16-bit unsigned data type, casting it as an 8-bit signed byte might give you unwanted behavior. However if your source data is ASCII, you will get exactly the type of behavior you expect.
Yes charAt (int) returns a Java defined char type (UTF-16) and is therefore always the same casted to byte.
In contrary String.getBytes() returns the bytes depending either on the specified charset or on the default charset of the OS if none is specified.
Conversion of a char to a byte with (byte) will give you the same result on all system.
However, it is very rare that you need to mix char and byte. You should really use one or the other. Mixing the concepts can lead to confusion as you suspect.
Instead of typecasting them directly, you could use the Character.codePointAt(char c) method. This should guarantee you the same result every time.
Q: When casting an int to a char in Java, it seems that the default result is the ASCII character corresponding to that int value. My question is, is there some way to specify a different character set to be used when casting?
(Background info: I'm working on a project in which I read in a string of binary characters, convert it into chunks, and convert the chunks into their values in decimal, ints, which I then cast as chars. I then need to be able to "expand" the resulting compressed characters back to binary by reversing the process.
I have been able to do this, but currently I have only been able to compress up to 6 "bits" into a single character, because when I allow for larger amounts, there are some values in the range which do not seem to be handled well by ASCII; they become boxes or question marks and when they are cast back into an int, their original value has not been preserved. If I could use another character set, I imagine I could avoid this problem and compress the binary by 8 bits at a time, which is my goal.)
I hope this was clear, and thanks in advance!
Your problem has nothing to do with ASCII or character sets.
In Java, a char is just a 16-bit integer. When casting ints (which are 32-bit integers) to chars, the only thing you are doing is keeping the 16 least significant bits of the int, and discarding the upper 16 bits. This is called a narrowing conversion.
References:
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#20232
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#25363
The conversion between characters and integers uses the Unicode values, of which ASCII is a subset. If you are handling binary data you should avoid characters and strings and instead use an integer array - note that Java doesn't have unsigned 8-bit integers.
What you search for in not a cast, it's a conversion.
There is a String constructor that takes an array of byte and a charset encoding. This should help you.
I'm working on a project in which I
read in a string of binary characters,
convert it into chunks, and convert
the chunks into their values in
decimal, ints, which I then cast as
chars. I then need to be able to
"expand" the resulting compressed
characters back to binary by reversing
the process.
You don't mention why you are doing that, and (to be honest) it's a little hard to follow what you're trying to describe (for one thing, I don't see why the resulting characters would be "compressed" in any way.
If you just want to represent binary data as text, there are plenty of standard ways of accomplishing that. But it sounds like you may be after something else?