I am accustomed to coding in java, but recently I have been making some ASP webpages that use C#.
In Java chars are default represented by their numeric ascii value unless you put them with a string. I have been unable to repeat this in C#.
What do I need to do to get ascii values of chars in C#?
char in .Net is a 2-byte structure representing a UTF-16 encoding of a unicode code point - of which ASCII is a tiny subset. But some unicode code points including certain Kanji characters require more than two bytes, and these are represented in a .Net string as a surrogate pair. Thus the most general way to get an unicode code point value for a character in a string at a specified index is Char.ConvertToUtf32(string s, int index)
For instance, the following enumerates the unicode code point values in a string:
public static IEnumerable<int> Utf32CodePoints(string s, int index)
{
for (int length = s.Length; index < length; index++)
{
yield return char.ConvertToUtf32(s, index);
if (char.IsSurrogatePair(s, index))
index++;
}
}
If you explicitly want only ASCII values and want to skip non-ASCII characters, you could use the ASCII decoder with appropriate exceptions, as shown here: Encoding.ASCII Property. Alternatively, just cast each char to an int and check if its value falls between U+0000 and U+007F, which is the defined range for ASCII.
ASCII is very small subset of characters that can be represented in C#/Java.
Fastest way to get ASCII code (assuming you know that value fits in ASCII range):
var ascii = ((int)c) & 0x7F;
You may want to add range checks (0-0x7F) and fail if value falls outside the range. Alternatively you can use Encoding.ASCII to do conversion (will replace characters outside of the range with question marks).
Note: if your "ascii" actually mean "numeric value"/UTF-16 Unicode code than basic cast to ushort (or int) will work:
var code = (int)c;
Related
I need to test the processing of a string which contains valid non-ascii characters + invalid non-ascii characters + invalid ascii characters.
Can someone please give me a couple of examples of such characters. It would be great if you could let me know the range of their value in their category as I am not quite able to differentiate which non-ascii values could be valid and which ones are invalid.
Ex : String str = "Bj��rk����oacute�";
^
Is it a valid or invalid non-ascii
FYI I am a beginner in Java.
There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.
The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.
Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.
You can determine if a byte can be parsed as an ASCII character, you can simply do this:
byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
// What's happening here? A byte that is in the range from 0 to 127 is
// valid, and other values are invalid. A byte in Java is signed, that
// means that valid ranges are from -128 to 127.
if (b >= 0) {
System.out.println("Valid ASCII");
}
else {
System.out.println("Invalid ASCII");
}
}
Some background
As Java was invented, a very important design decision was that text in java would be Unicode: a numbering system of all graphemes in the world. Hence char is two bytes (in UTF-16, one of the Unicode "universal character set transformation format"). And byte is a distinct type for binary data.
Unicode numbers all symbols, so-called code points, like ♫, as U+266B. Those numbers reaching the three byte integers. Hence code points in java are represented as int.
ASCII is a 7-bits subset of Unicode UTF-8, 0 - 127.
UTF-8 is a multibyte Unicode format, where ASCII is a valid subset, and higher symbols
Validity
You were asked to identify "invalid" characters = wrongly produced code points.
You could also identify code parts that produce invalid characters. (Easier.)
In the above � is a place holder character (like ?) that substitutes a code point not being representable in the current character set. If the code produced a ? as place holder, one cannot guess whether substitution took place. For some west European languages the encoding is Windows-1252 (Cp1252, MS Windows Latin-1) having. You can check whether a code point from a String can be converted to that Charset.
Then remain false positives, wrong characters that however exist in Cp1252. That could be a multi-byte code sequence of UTF-8, interpreted as several Window-1252 characters. So: an acceptable non-ASCII char adjacent to a unacceptable non-ASCII char is suspect too. That means you need to list the special characters in your language, and extras: like special quotes, in English borrows like ç, ñ.
For MS-Windows Latin-1 (an altered ISO Latin-1) something like:
boolean isSuspect(char ch) {
if (ch < 32) {
return "\f\n\r\t".indexOf(ch) != -1;
} else if (ch >= 127) {
return false;
} else {
return suspects.get((int) ch); // Better use a positive list.
}
}
static BitSet suspects = new BitSet(256);
static {
...
}
I was practicing example interview questions and one of them was:
"implement an algorithm to determine if a string has all unique characters".
It's easy when we assume that is ASCII/ANSI.
implement-an-algorithm-to-determine-if-a-string-has-all-unique-charact
But my question is: how should that be solved if let's say string can contain e.g. hieroglyphic symbols or whatever (code points are greater than U+FFFF... ?).
So if I understood it correctly I can easily think of solution if given string contains characters that belong to the set of characters from U+0000 to U+FFFF - they can be converted into 16-bit char, but what if I encounter a character whose code points are greater than U+FFFF... ?
Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF)
But I have no idea how to solve this puzzle in that case, how do I handle those surrogate pairs ?
Thanks!
Java 8 has a CharSequence#codePoints method that produces an IntStream of the Unicode codepoints in a string. From there it just becomes a matter of writing code to test uniqueness of elements in the IntStream.
If you're still in Java 7 or below, there are codepoint-based methods in there that can be used to solve this as well, but they much more complex to use. You'd have to loop over the chars of the string and examine each one's value to tell whether you're dealing with surrogate pairs or not. Something like (thoroughly untested):
for (int i = 0; i < str.length(); i++) {
int codepoint = str.codePointAt(i++);
if (Character.isHighSurrogate(str.charAt(i))) {
// This will fail if the UTF-16 representation of
// this string is wrong (e.g., high surrogate `char`
// at the end of the string's `char[]`).
i += 1;
}
// do stuff with codepoint...
}
When I do Collection.sort(List), it will sort based on String's compareTo() logic,where it compares both the strings char by char.
List<String> file1 = new ArrayList<String>();
file1.add("1,7,zz");
file1.add("11,2,xx");
file1.add("331,5,yy");
Collections.sort(file1);
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc. How can I do it? Any url contains the numeric value of these?
My understanding is char means it specifies the unicode value, I want to know the unicode values of char like ,(comma) etc
Well there's an implicit conversion from char to int, which you can easily print out:
int value = ',';
System.out.println(value); // Prints 44
This is the UTF-16 code unit for the char. (As fge notes, a char in Java is a UTF-16 code unit, not a Unicode character. There are Unicode code points greater than 65535, which are represented as two UTF-16 code units.)
Any url contains the numeric value of these?
Yes - for more information about Unicode, go to the Unicode web site.
Uhm no, char is not a "unicode value" (and the word to use is Unicode code point).
A char is a code unit in the UTF-16 encoding. And it so happens that in Unicode's Basic Multilingual Plane (ie, Unicode code points ranging from U+0000 to U+FFFF, for code points defined in this range), yes, there is a 1-to-1 mapping between char and Unicode.
In order to know the numeric value of a code point you can just do:
System.out.println((int) myString.charAt(0));
But this IS NOT THE CASE for code points outside the BMP. For these, one code point translates to two chars. See Character.toChars(). And more generally, all static methods in Character relating to code points. There are quite a few!
This also means that String's .length() is actually misleading, since it returns the number of chars, not the number of graphemes.
Demonstration with one Unicode emoticon (the first in that page):
System.out.println(new String(Character.toChars(0x1f600)).length())
prints 2. Whereas:
final String s = new String(Character.toChars(0x1f600));
System.out.println(s.codePointCount(0, s.length());
prints 1.
How can I display a Unicode Character above U+FFFF using char in Java?
I need something like this (if it were valid):
char u = '\u+10FFFF';
You can't do it with a single char (which holds a UTF-16 code unit), but you can use a String:
// This represents U+10FFFF
String x = "\udbff\udfff";
Alternatively:
String y = new StringBuilder().appendCodePoint(0x10ffff).toString();
That is a surrogate pair (two UTF-16 code units which combine to form a single Unicode code point beyond the Basic Multilingual Plane). Of course, you need whatever's going to display your data to cope with it too...
Instead of using StringBuilder, you can also use a function
directly found in the Character class. The function is
toChars() and it has the following spec:
Converts the specified character (Unicode code point) to
its UTF-16 representation stored in a char array.
So you don't need to exactly know how the surrogate pairs look
like and you can directly use the code point. An example code
then looks as follows:
int ch = 0x10FFFF;
String s = new String(Character.toChars(ch));
Note that the datatype for the code point is int and not char.
Unicode characters can take more than two bytes which can't be in general hold in a char.
Source
The char data type are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
In the J2SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?
Does this boil down to what character encoding you are using?
You can handle them all if you're careful enough.
Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).
See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
To add to the other answers, some points to remember:
A Java char takes always 16 bits.
A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).
"Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.
You said:
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.
Unicode grows
Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow — and not just because of emojis.
143,859 characters in Unicode 13 (Java 15, release notes)
137,994 characters in Unicode 12.1 (Java 13 & 14)
136,755 characters in Unicode 10 (Java 11 & 12)
120,737 characters in Unicode 8 (Java 9)
110,182 characters in Unicode 6.2 (Java 8)
109,449 characters in Unicode 6.0 (Java 7)
96,447 characters in Unicode 4.0 (Java 5 & 6)
49,259 characters in Unicode 3.0 (Java 1.4)
38,952 characters in Unicode 2.1 (Java 1.1.7)
38,950 characters in Unicode 2.0 (Java 1.1)
34,233 characters in Unicode 1.1.5 (Java 1.0)
char is legacy
The char type is long outmoded, now legacy.
Use code point numbers
Instead, you should be working with code point numbers.
You asked:
Does this mean that you can't handle certain Unicode characters in a Java application?
The char type can address less than half of today's Unicode characters.
To represent any Unicode character, use code point numbers. Never use char.
Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areas and will never be assigned.
The String class has gained methods for working with code point numbers, as did the Character class.
Get the code point number for any character in a string, by zero-based index number. Here we get 97 for the letter a.
int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.
For the more general CharSequence rather than String, use Character.codePointAt.
We can get the Unicode name for a code point number.
String name = Character.getName( 97 ) ; // letter `a`
LATIN SMALL LETTER A
We can get a stream of the code point numbers of all the characters in a string.
IntStream codePointsStream = "Cat".codePoints() ;
We can turn that into a List of Integer objects. See How do I convert a Java 8 IntStream to a List?.
List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;
Any code point number can be changed into a String of a single character by calling Character.toString.
String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A.
a
We can produce a String object from an IntStream of code point numbers. See Make a string from an IntStream of code point numbers?.
IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).
String output =
intStream
.collect( // Collect the results of processing each code point.
StringBuilder :: new , // Supplier<R> supplier
StringBuilder :: appendCodePoint , // ObjIntConsumer<R> accumulator
StringBuilder :: append // BiConsumer<R,R> combiner
) // Returns a `CharSequence` object.
.toString(); // If you would rather have a `String` than `CharSequence`, call `toString`.
Cat 🐈
You asked:
Does this boil down to what character encoding you are using?
Internally, a String in Java is always using UTF-16.
You only use other character encoding when importing or exporting text in or out of Java strings.
So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to see that character, you must be using a font with a glyph defined for that particular character.
When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful.
Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.
In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
char is a UTF-16 code unit, not a code point
new low-level APIs use an int to represent a Unicode code point
high level APIs have been updated to understand surrogate pairs
a preference towards char sequence APIs instead of char based methods
Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.
Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
2 platform uses the UTF-16 representation in char arrays and in the
String and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP)
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. The lower (least significant) 21
bits of int are used to represent Unicode code points and the upper
(most significant) 11 bits must be zero. Unless otherwise specified,
the behavior with respect to supplementary characters and surrogate
char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate
ranges as undefined characters. For example,
Character.isLetter('\uD840') returns false, even though this specific
value if followed by any low-surrogate value in a string would
represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example,
Character.isLetter(0x2F81A) returns true because the code point value
represents a letter (a CJK ideograph).
From the OpenJDK7 documentation for String:
A String represents a string in the
UTF-16 format in which supplementary
characters are represented by
surrogate pairs (see the section
Unicode Character Representations in
the Character class for more
information). Index values refer to
char code units, so a supplementary
character uses two positions in a
String.