Q: When casting an int to a char in Java, it seems that the default result is the ASCII character corresponding to that int value. My question is, is there some way to specify a different character set to be used when casting?
(Background info: I'm working on a project in which I read in a string of binary characters, convert it into chunks, and convert the chunks into their values in decimal, ints, which I then cast as chars. I then need to be able to "expand" the resulting compressed characters back to binary by reversing the process.
I have been able to do this, but currently I have only been able to compress up to 6 "bits" into a single character, because when I allow for larger amounts, there are some values in the range which do not seem to be handled well by ASCII; they become boxes or question marks and when they are cast back into an int, their original value has not been preserved. If I could use another character set, I imagine I could avoid this problem and compress the binary by 8 bits at a time, which is my goal.)
I hope this was clear, and thanks in advance!
Your problem has nothing to do with ASCII or character sets.
In Java, a char is just a 16-bit integer. When casting ints (which are 32-bit integers) to chars, the only thing you are doing is keeping the 16 least significant bits of the int, and discarding the upper 16 bits. This is called a narrowing conversion.
References:
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#20232
http://java.sun.com/docs/books/jls/second_edition/html/conversions.doc.html#25363
The conversion between characters and integers uses the Unicode values, of which ASCII is a subset. If you are handling binary data you should avoid characters and strings and instead use an integer array - note that Java doesn't have unsigned 8-bit integers.
What you search for in not a cast, it's a conversion.
There is a String constructor that takes an array of byte and a charset encoding. This should help you.
I'm working on a project in which I
read in a string of binary characters,
convert it into chunks, and convert
the chunks into their values in
decimal, ints, which I then cast as
chars. I then need to be able to
"expand" the resulting compressed
characters back to binary by reversing
the process.
You don't mention why you are doing that, and (to be honest) it's a little hard to follow what you're trying to describe (for one thing, I don't see why the resulting characters would be "compressed" in any way.
If you just want to represent binary data as text, there are plenty of standard ways of accomplishing that. But it sounds like you may be after something else?
Related
I can't believe I'm having a hard time with this, but so far haven't found the answer: Let's say I have a Java char (or a 1-character String) and I want to convert it into a byte of ASCII. How do I do this?
I know I could look up the decimal value of the ASCII character and create a byte from that, but it seems like there should be a simple conversion. I found what seems to be the answer for byte arrays:
byte[] asciiArray = "SomeString".getBytes( StandardCharsets.US_ASCII );
but not for just a single byte. Something like:
byte asciiA = <some conversion function>('A');
If you are sure that the character in the range U+0000 to U+007F, it would be one UTF-16 code unit (char), and also in the ASCII character set, and that UTF-16 code unit would have the same value as the ASCII code unit.
You might want to add a guard because (byte)'½' wouldn't give you anything useful.
Recently I found some negative bytes hidden in a Java string in my code which was causing a .equals String comparison to fail.
What is the significance of negative byte values in Strings? Can they mean anything? It there any situation in which a negative byte value in a String could be interpreted as anything? I'm noob at this encoding business so if it requires explanation into different encoding schemes please feel free.
A Java string contains characters, BUT you can interpret them in different ways. If each character is a byte, then it can range from 0-255, inclusive. That's 8 bits.
Now, the leftmost bit can be interpreted as a sign bit or as part of the magnitude of the character. If that bit is interpreted as a sign bit then you will have data items ranging from -128 to +127, inclusive.
You didn't post the code you used to print the characters but if you used logic that interpreted the characters as signed data items then you will get negative numbers in out output.
We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc. In some more example we see byte[] = String.getBytes(). A byte is just 8-bit value. Why we use byte[] to read binaries? What does a byte value contains after reading from file or string ?
We read and write binary files using the java primitive 'byte' like fileInputStream.read(byte) etc.
Because the operating system models files as sequences of bytes (or more precisely, as octets). The byte type is the most natural representation of an octet in Java.
Why we use byte[] to read binaries?
Same answer as before. Though, in reality, you can also read binary files in other ways as well; e.g. using DataInputStream.
What does a byte value contains after reading from file or string ?
In the first case, the byte that was in the file.
In the second case, you don't "read" bytes from a String. Rather, when you call the String.getBytes() you get the bytes that comprise the String's characters when they are encoded in a particular character-set. If you use the no-args getBytes() method you will get the JVM's default character-set / encoding. You can also supply an argument to choose a different encoding.
Java makes a clear distinction between bytes (8 bit) quantities and characters. Conceptually, Java characters are Unicode code points, and strings and similar representations of text are sequences of characters ... not sequences of bytes.
(Unfortunately, there is a "wrinkle" in the implementation. When Java was designed, the Unicode character space fitted into a 16 bits; i.e. there were <= 65536 recognized code points. Java was designed to match this ... and the char type was defined as a 16 bit unsigned integral type. But then Unicode was expanded to > 65536 code points, and Java was left with the awkward problem that some Unicode code points could not be represented using one char values. Instead, they are represented by a pair of char values ... a so-called surrogate pair ... and Java strings are effectively represented in UTF-16. For most common characters / character-sets, this doesn't matter. But if you need to deal with unusual characters / character-sets, the correct way to deal with Strings is to use the "codepoint" methods.)
The String is built upon bytes. The bytes are built upon bits. The bits are "physically" stored on the drive.
So instead of reading data from drive bit by bit it is read in larger portions which are bytes.
So the byte[] contains raw data. Raw data are equal to that what is stored on drive.
You eventually alaways read raw data, then you can apply a formatter what turns that bytes into characters and eventually into letters dispalyed on the screed if that is a txt file. If you dead with image out will read bytes that store the information about color instaed of character.
Because the smallest storage unit is byte.
I have a string in Radix64 characters:
HR5nYD8xGrw
and I need to be able to perform bitwise operations on the bits in this string, but preserve the Radix64 encoding. For example, if I do a left shift, have it drop the overflow bit, and stay inside the character set of Radix64, not turn into some random ASCII character. Aside from manually converting them to binary and writing my own versions of all of the operators I would need, is there a way to do this?
You just convert them to plain numbers, apply the shift to them and convert back to "base64".
It's not different to applying bit operators to numbers written in base 10, you don't use the string, you use the number corrresponding to the string, and then print it back to a string.
9 << 1 == 18
but "9" and "18" are not really related as strings...
When writing a custom string class that stores UTF-8 internally (to save memory) rather than UTF-16 from scratch is it feasible to some extent cache the relationship between byte offset and character offset to increase performance when applications use the class with random access?
Does Perl do this kind of caching of character offset to byte offset relationship? How do Python strings work internally?
What about Objective-C and Java? Do they use UTF-8 internally?
EDIT
Found this reference to Perl 5 using UTF-8 internally:
"$flag = utf8::is_utf8(STRING)
(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally. Functionally the same as Encode::is_utf8()."
On page
http://perldoc.perl.org/utf8.html
EDIT
In the applications I have in mind, the strings have 1-2K XML stanzas in an XMPP stream. About 1% of the messages are going to have I expect up to 50% (by character count) of Unicode values > 127 (this is XML). In servers, the messages are rule-checked and routed conditionally on a small (character volume wise) subset of fields. The servers are Wintel boxes operating in a farm. In clients, the data comes from and is fed into UI toolkits.
EDIT
But the app wil inevitably evolve and want to do some random access too. Can the performance hit when this happens be minimised: I was also interested if a more general class design exists that eg manages b-trees of character offset <-> byte offset relationships for big UTF8 strings (or some other algorithm found to be efficient in the general case.)
Perl distinguishes between Unicode and non-Unicode strings. Unicode strings are implemented using UTF-8 internally. Non-Unicode does not necessarily mean 7-bit ASCII, though, it could be any character that can be represented in the current locale as a single byte.
I think the answer is: in general, it's not really worth trying to do this. In your specific case, maybe.
If most of your characters are plain ASCII, and you rarely have UTF sequences, then it might be worth building some kind of sparse data structure with the offsets.
In the general case, every single character might be non-ASCII and you might have many many offsets to store. Really, the most general case would be to make a string of bytes that is exactly as long as your string of Unicode characters, and have each byte value be the offset of the next character. But this means one whole byte per character, and thus a net savings of only one byte per Unicode character; probably not worth the effort. And that implies that indexing into your string is now an O(n) operation, as you run through these offsets and sum them to find the actual index.
If you do want to try the sparse data structure, I suggest an array of pairs of values, the first value being the index within the Unicode string of a character, and the second one being the index within the byte sequence where this character actually appears. Then after each UTF8 escape sequence, you would add the two values to find the next character in the string. Finally, when given an index to a Unicode character, your code could do a binary search of this array, to find the highest index within the sparse array that is lower than the requested index, and then use that to find the actual byte that represents the start of the desired character.
If you need to save memory, you might want to consider using a data compression library. Slurp in the Unicode strings as full Unicode, then compress them; then to index into a string, first you uncompress that string. This will really save memory, and it will be easy and fast to get the code correct to make it work; but it may add too much CPU overhead to be reasonable.
Java's strings are UTF-16 internally:
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
java.lang.String