TL;DR: In Java, will casting a character obtained from a String via the charAt method to a byte always yield the same value?
I am reading files which are encoded with arbitrary (unknown to us) character encodings. I need to parse these files and look for certain words, e.g. "TAG". I placed certain restrictions on the file contents, such as "when looking for a tag, the bytes for "TAG" must be the same as their ASCII representation".
For example, suppose I have the following file:
0x00 0x11 0x22 0x33 0x54 0x41 0x47 0x77 0x88 0x99 0xaa 0xbb
Since the ASCII values for T, A and G are respectively 0x54, 0x41 and 0x47, I can find "TAG" in the file by parsing the bytes themselves.
0x00 0x11 0x22 0x330x54 0x41 0x470x77 0x88 0x99 0xaa 0xbb
However, I need to hard-code the value of the bytes I am looking for. To do this, I call String's charAt(int i) method and cast the char to a byte.
Here is, for example, how I would verify an arbitrary byte (called b) for the byte representation of 'T':
String tag = "TAG";
char t = tag.charAt(0);
if ((byte)t == b){
//magic goes here, such as comparing the 'A' and the 'G'
}
Note: the code is not actually like that, and the verification algorithm is much more elegant.
This works fine on my local machine. However, this will be run on machines which may contain very strange encodings. What worries me is whether casting a character obtained with charAt to a byte might yield a different value depending on the machine. I know that Java always encodes chars with the UTF-16 character encoding, but I am worried that when converting from a String to a character and then to a byte might yield strange results.
So, in short, will casting a character obtained from a String via the charAt method to a byte always yield the same value? Or will it depend on an external factor?
Thanks for your help!
Note: I cannot hard-code the bytes themselves (in, for example, a byte array) since they can be very very long and may be changed very often in the future.
java.lang.string.charAt will always return a 16 bit UTF-16 character, which will always be the same when you cast it to a byte, though because char is a 16-bit unsigned data type, casting it as an 8-bit signed byte might give you unwanted behavior. However if your source data is ASCII, you will get exactly the type of behavior you expect.
Yes charAt (int) returns a Java defined char type (UTF-16) and is therefore always the same casted to byte.
In contrary String.getBytes() returns the bytes depending either on the specified charset or on the default charset of the OS if none is specified.
Conversion of a char to a byte with (byte) will give you the same result on all system.
However, it is very rare that you need to mix char and byte. You should really use one or the other. Mixing the concepts can lead to confusion as you suspect.
Instead of typecasting them directly, you could use the Character.codePointAt(char c) method. This should guarantee you the same result every time.
Related
My goal is to conserve space in my data store, which only accepts Strings.
Because a String in Java is a 16-bit array, I figure that in theory I should be able to convert my 8-byte long into a 4-char String, as both are represented by 8 bytes. (To be clear, I am not interested in making my long integer human-readable in base 10, I want to store it in as short of a String as possible.)
However, almost all the literature I have found on this is about converting to the 8-bit byte type, not the type char.
I could encode as UTF8. I am concerned this would mean I double the length of String, as each 8-bit byte is stored as a 16-bit char. This would defeat my whole purpose for compacting my data into a 64-bit medium in the first place.
private static final Charset UTF8_CHARSET = Charset.forName("UTF-8");
new String(ByteBuffer.allocate(8).putLong(value).array(), UTF8_CHARSET);
Is my concern correct that I would be wasting space, and if so, is there a way to not waste space?
char != int
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.
String != byte[] && char != int
Internally String objects are Unicode and encoded as UTF-16 no matter what their source is.
How is text represented in the Java platform?
The Java programming language is based on the Unicode character set,
and several libraries implement the Unicode standard. The primitive
data type char in the Java programming language is an unsigned 16-bit
integer that can represent a Unicode code point in the range U+0000 to
U+FFFF, or the code units of UTF-16. The various types and classes in
the Java platform that represent character sequences - char[],
implementations of java.lang.CharSequence (such as the String class),
and implementations of java.text.CharacterIterator - are UTF-16
sequences.
String is internally represented by UTF-16
The character encodings like UTF-8 are only for interpreting or converting to/from a byte[].
Even if you write a custom CharsetProvider all that will do is encode/decode a byte[] externally, this will absolutely not change the fact that a String is internally represented by UTF-16, so what you want to do is kind of pointless.
Can't be done
Character is actually a 32 bit number, the Charset is just an encoding of that 32 bit number. UTF-8 can be 1, 2, 3 or 4 bytes for example, and UTF-16 is 2,4 bytes with a bit specifying if the next byte(s) is part of the same character or not.
I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.
I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table
Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?
EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół
You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.
There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.
ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.
UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.
You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.
Ususally when I need to convert my string to byte[] I use getBytes() without param. I was checked it is not save I should use charset. Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Ususally when I need to convert my string to byte[] I use getBytes() without param.
Stop doing that right now. I would suggest that you always specify an encoding. If you want to use the platform default encoding (which is what you'll get if you don't specify one), then do that explicitly so that it's clearer. But that should very rarely be the approach anyway. Personally I use UTF-8 in almost all cases.
Why I shoud do so - letter 'A' will always be parsed to 0x41? Is't it?
Nope. For example, using UTF-16, 'A' will be two bytes - 0x41 0x00 or 0x00 0x41 (depending on the endianness). In EBCDIC encodings it could be something completely different.
Most encodings treat ASCII characters in the same way - but characters outside ASCII are represented very differently in different encodings (and many encodings only support a subset of Unicode).
See my article on Unicode (C#-focused, but the principles are the same) for a few more details - and links to more information than you're ever likely to want.
Different character encodings lead to different ways characters get parsed. In Ascii, sure 'A' will parse to 0x41. In other encodings, this will be different.
This is why when you go to some webpages, you may see a bunch of weird characters. The browser doesn't know how to decode it, so it just decodes to the default.
Some background: When text is stored in files or sent between computers over a socket, the text characters are stored or sent as a sequence of bits, almost always grouped in 8-bit bytes. The characters all have defined numeric values in Unicode, so that 'A' always has the value 0x41 (well, there are actually two other A's in the Unicode character set, in the Greek and Russian alphabets, but that's not relevant). But there are many mechanisms for how those numeric codes are translated to a sequence of bits when storing in a file or sending to another computer. In UTF-8, 0x41 is represented as 8 bits (the byte 0x41), but other numeric values (code points) will be converted to 16 or more bits with an algorithm that rearranges the bits; in UTF-16, 0x41 is represented as 16 bits; and there are other encodings like JIS and some which are capable of representing some but not all of the Unicode characters. Since String.getBytes() was intended to return a byte array that contains the bytes to be sent to a file or socket, the method needs to know what encoding it's supposed to use when creating those bytes. Basically the encoding will have to be the same one that a program later reading a file, or a computer at the other end of the socket, expects it to be.
Title is pretty self-explanatory. In a lot of the JRE javadocs I see the phrases "stream of bytes" and "stream of characters" all over the place.
But aren't they the same thing? Or are they slightly different (e.g. interpreted differently) in Java-land? Thanks in advance.
In Java, a byte is not the same thing as a char. Therefore a byte stream is different from a character stream. Bytes are intended for arbitrary binary data; characters are specifically for data representing the building blocks of strings.
but if a char is only 1 byte in width
Except that it's not.
As per the JLS §4.2.1 a char is a number in the range:
from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
But a byte is a number in the range
from -128 to 127, inclusive
Stream of byte is just plain byte, like how you would see it when you open a file in HEX Editor.
Character is different from just plain byte. ASCII encoding uses exactly 1 byte per character, but that is not true for many other encoding. For example, UTF-8 encoding may use from 1 to 4 bytes to encode a single character. Stream of character is designed to abstract away the underlying encoding, and produce char of one type of encoding (in Java, char and String uses UTF-16 encoding).
As a rule of thumb:
When you are dealing with text, you must use stream of character to decode the byte into character with the appropriate encoding.
When you are dealing with binary data or mixed of binary and text, you must use stream of byte, since it doesn't make sense otherwise. If a sequence of byte represents a String in certain encoding, then you can always pick those bytes out and use String(byte[] bytes, Charset charset) constructor to get back the String.
They are different. char is a 2-byte datatype in Java: byte is a 1-byte datatype.
Edit: char is also an unsigned type, while byte is not.
Generally it is better off to talk about streams in terms of their sizes, rather than what they carry. Stream of bytes is more intuitive than streams of chars, because streams of chars make us have to double check that a char is indeed a byte, not a unicode char, or anything fancy.
A char is a representation, which can be represented by a byte, but a byte is always going to be a byte. All world will burn when bytes will stop being 8 bits.
I made the following "simulation":
byte[] b = new byte[256];
for (int i = 0; i < 256; i ++) {
b[i] = (byte) (i - 128);
}
byte[] transformed = new String(b, "cp1251").getBytes("cp1251");
for (int i = 0; i < b.length; i ++) {
if (b[i] != transformed[i]) {
System.out.println("Wrong : " + i);
}
}
For cp1251 this outputs only one wrong byte - at position 25.
For KOI8-R - all fine.
For cp1252 - 4 or 5 differences.
What is the reason for this and how can this be overcome?
I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice.
Update: representing it in ISO-8859-1 works, and I'll use it for the byte[] part, and cp1251 for the textual part, so the question remains only out of curiousity
Some of the "bytes" are not supported in the target set - they are replaced with the ? character. When you convert back, ? is normally converted to the byte value 63 - which isn't what it was before.
What is the reason for this
The reason is that character encodings are not necesarily bijective and there is no good reason to expect them to be. Not all bytes or byte sequences are legal in all encodings, and usually illegal sequences are decoded to some sort of placeholder character like '?' or U+FFFD, which of course does not produce the same bytes when re-encoded.
Additionally, some encodings may map some legal different byte sequences to the same string.
It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable".
The javadoc for String(byte[], String) says this:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.
Other constructors say this:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem.
I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?
Actually there shall be one difference: a byte of value 24 is converted to a char of value 0xFFFD; that's the "Unicode replacement character", used for untranslatable bytes. When converted back, you get a question mark (value 63).
In CP1251, the code 24 means "end of input" and cannot be part of a proper string, which is why Java deems it as "untranslatable".
Historical reason: in the ancient character encodings (EBCDIC, ASCII) the first 32 codes have special 'control' meaning and they may not map to readable characters. Examples: backspace, bell, carriage return. Newer character encoding standards usually inherit this and they don't define Unicode characters for every one of the first 32 positions. Java characters are Unicode.