Recently I found some negative bytes hidden in a Java string in my code which was causing a .equals String comparison to fail.
What is the significance of negative byte values in Strings? Can they mean anything? It there any situation in which a negative byte value in a String could be interpreted as anything? I'm noob at this encoding business so if it requires explanation into different encoding schemes please feel free.
A Java string contains characters, BUT you can interpret them in different ways. If each character is a byte, then it can range from 0-255, inclusive. That's 8 bits.
Now, the leftmost bit can be interpreted as a sign bit or as part of the magnitude of the character. If that bit is interpreted as a sign bit then you will have data items ranging from -128 to +127, inclusive.
You didn't post the code you used to print the characters but if you used logic that interpreted the characters as signed data items then you will get negative numbers in out output.
Related
I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.
I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table
Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?
EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół
You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.
There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.
ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.
UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.
You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.
I am beginner and self-learning in Java programming.
So, I want to know about difference between String.length() and String.getBytes().length in Java.
What is more suitable to check the length of the string?
String.length()
String.length() is the number of 16-bit UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string and thus also equal to toCharArray().length. For most characters used in western languages this is typically the same as the number of unicode characters (code points) in the string, but the number of code point will be less than the number of code units if any UTF-16 surrogate pairs are used. Such pairs are needed only to encode characters outside the BMP and are rarely used in most writing (emoji are a common exception).
String.getBytes().length
String.getBytes().length on the other hand is the number of bytes needed to represent your string in the platform's default encoding. For example, if the default encoding was UTF-16 (rare), it would be exactly 2x the value returned by String.length() (since each 16-bit code unit takes 2 bytes to represent). More commonly, your platform encoding will be a multi-byte encoding like UTF-8.
This means the relationship between those two lengths are more complex. For ASCII strings, the two calls will almost always produce the same result (outside of unusual default encodings that don't encode the ASCII subset in 1 byte). Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
Which is more suitable?
Usually you'll use String.length() in concert with other string methods that take offsets into the string. E.g., to get the last character, you'd use str.charAt(str.length()-1). You'd only use the getBytes().length if for some reason you were dealing with the array-of-bytes encoding returned by getBytes.
The length() method returns the length of the string in characters.
Characters may take more than a single byte. The expression String.getBytes().length returns the length of the string in bytes, using the platform's default character set.
The String.length() method returns the quantity of symbols in string. While String.getBytes().length() returns number of bytes used to store those symbols.
Usually chars are stored in UTF-16 encoding. So it takes 2 bytes to store one char.
Check this SO answer out.
I hope that it will help :)
In short, String.length() returns the number of characters in the string while String.getBytes().length returns the number of bytes to represent the characters in the string with specified encoding.
In many cases, String.length() will have the same value as String.getBytes().length. But in cases like encoding UTF-8 and the character has value over 127, String.length() will not be the same as String.getBytes().length.
Here is an example which explains how characters in string is converted to bytes when calling String.getBytes(). This should give you a sense of the difference between String.length() and String.getBytes().length.
I know that ASCII codes are between 0-127 in decimal and 0000 0000 to 0111 1111 in binary, and that values between 128-255 are extended ASCII.
I also know that int accepts 9 digits(which I was wrong the range int is between(-2,147,483,648 to 2,147,483,647)), so if we cast every number between (0-MaxintRange) to a char, there will be many many symbols; for example:
(char)999,999,999 gives 짿 which is a Korean symbol (I don't know what it even means; Google Translate can't find any meaning!).
The same thing happens with values between minintrange to 0.
It doesn't make sense that those symbols were input one by one.
I don't understand - how could they assign those big numbers to have its own character?
I don't understand how they assign those big numbers to have it's own symbol?
The assignments are made by the Unicode consortium. See http://unicode.org for details.
In your particular case however you are doing something completely nonsensical. You have the integer 999999999 which in hex is 0x3B9AC9FF. You then cast that to char, which discards the top four bytes and gives you 0xC9FF. If you then look that up at Unicode.org: http://www.unicode.org/cgi-bin/Code2Chart.pl and discover that yes, it is a Korean character.
Unicode code points can in fact be quite large; there are over a million of them. But you can't get to them just by casting. To get to Unicode code points that are outside of the "normal" range using UTF-16 (as C# does), you need to use two characters. See http://en.wikipedia.org/wiki/UTF-16, the section on surrogate pairs.
To address some of the other concerns in your question:
I know that ACCII codes are between (0-127) in decimal and (0000 0000 to 0000 1111) in binary.
That's ASCII, not ACCII, and 127 in binary is 01111111, not 00001111
Also we know that int accepts 9 digits, so if we cast every number between
The range of an int is larger than that.
don't know what it mean even Google translate can't find any meaning
Korean is not like Chinese, where each glyph represents a word. Those are letters. They don't have a meaning unless they happen to accidentally form a word. You'd have about as much luck googling randomly chosen English letters and trying to find their meaning; maybe sometimes you'd choose CAT at random, but most of the time you'd choose XRO or some such thing that is not a word.
Read this if you want to understand how the Korean alphabet works: http://en.wikipedia.org/wiki/Hangul
I have a string in Radix64 characters:
HR5nYD8xGrw
and I need to be able to perform bitwise operations on the bits in this string, but preserve the Radix64 encoding. For example, if I do a left shift, have it drop the overflow bit, and stay inside the character set of Radix64, not turn into some random ASCII character. Aside from manually converting them to binary and writing my own versions of all of the operators I would need, is there a way to do this?
You just convert them to plain numbers, apply the shift to them and convert back to "base64".
It's not different to applying bit operators to numbers written in base 10, you don't use the string, you use the number corrresponding to the string, and then print it back to a string.
9 << 1 == 18
but "9" and "18" are not really related as strings...
Q: When casting an int to a char in Java, it seems that the default result is the ASCII character corresponding to that int value. My question is, is there some way to specify a different character set to be used when casting?
(Background info: I'm working on a project in which I read in a string of binary characters, convert it into chunks, and convert the chunks into their values in decimal, ints, which I then cast as chars. I then need to be able to "expand" the resulting compressed characters back to binary by reversing the process.
I have been able to do this, but currently I have only been able to compress up to 6 "bits" into a single character, because when I allow for larger amounts, there are some values in the range which do not seem to be handled well by ASCII; they become boxes or question marks and when they are cast back into an int, their original value has not been preserved. If I could use another character set, I imagine I could avoid this problem and compress the binary by 8 bits at a time, which is my goal.)
I hope this was clear, and thanks in advance!
Your problem has nothing to do with ASCII or character sets.
In Java, a char is just a 16-bit integer. When casting ints (which are 32-bit integers) to chars, the only thing you are doing is keeping the 16 least significant bits of the int, and discarding the upper 16 bits. This is called a narrowing conversion.
References:
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#20232
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#25363
The conversion between characters and integers uses the Unicode values, of which ASCII is a subset. If you are handling binary data you should avoid characters and strings and instead use an integer array - note that Java doesn't have unsigned 8-bit integers.
What you search for in not a cast, it's a conversion.
There is a String constructor that takes an array of byte and a charset encoding. This should help you.
I'm working on a project in which I
read in a string of binary characters,
convert it into chunks, and convert
the chunks into their values in
decimal, ints, which I then cast as
chars. I then need to be able to
"expand" the resulting compressed
characters back to binary by reversing
the process.
You don't mention why you are doing that, and (to be honest) it's a little hard to follow what you're trying to describe (for one thing, I don't see why the resulting characters would be "compressed" in any way.
If you just want to represent binary data as text, there are plenty of standard ways of accomplishing that. But it sounds like you may be after something else?