Valid/invalid non-ascii and invalid ascii characters - java

I need to test the processing of a string which contains valid non-ascii characters + invalid non-ascii characters + invalid ascii characters.
Can someone please give me a couple of examples of such characters. It would be great if you could let me know the range of their value in their category as I am not quite able to differentiate which non-ascii values could be valid and which ones are invalid.
Ex : String str = "Bj��rk����oacute�";
^
Is it a valid or invalid non-ascii
FYI I am a beginner in Java.

There are 128 valid basic ASCII characters, mapped to the values 0 (the NUL byte) to 127 (the DEL character). See here.
The word 'character' must be used wisely. The definition of 'character' is a special one. For example, the è, is that one character? Or is it two characters (e and `)? It depends.
Secondly, a sequence of characters is completely independent from its encoding. For simplicity, I assume that each byte is interpreted as one character.
You can determine if a byte can be parsed as an ASCII character, you can simply do this:
byte[] bytes = "Bj��rk����oacute�".getBytes();
for (byte b : bytes) {
// What's happening here? A byte that is in the range from 0 to 127 is
// valid, and other values are invalid. A byte in Java is signed, that
// means that valid ranges are from -128 to 127.
if (b >= 0) {
System.out.println("Valid ASCII");
}
else {
System.out.println("Invalid ASCII");
}
}

Some background
As Java was invented, a very important design decision was that text in java would be Unicode: a numbering system of all graphemes in the world. Hence char is two bytes (in UTF-16, one of the Unicode "universal character set transformation format"). And byte is a distinct type for binary data.
Unicode numbers all symbols, so-called code points, like ♫, as U+266B. Those numbers reaching the three byte integers. Hence code points in java are represented as int.
ASCII is a 7-bits subset of Unicode UTF-8, 0 - 127.
UTF-8 is a multibyte Unicode format, where ASCII is a valid subset, and higher symbols
Validity
You were asked to identify "invalid" characters = wrongly produced code points.
You could also identify code parts that produce invalid characters. (Easier.)
In the above � is a place holder character (like ?) that substitutes a code point not being representable in the current character set. If the code produced a ? as place holder, one cannot guess whether substitution took place. For some west European languages the encoding is Windows-1252 (Cp1252, MS Windows Latin-1) having. You can check whether a code point from a String can be converted to that Charset.
Then remain false positives, wrong characters that however exist in Cp1252. That could be a multi-byte code sequence of UTF-8, interpreted as several Window-1252 characters. So: an acceptable non-ASCII char adjacent to a unacceptable non-ASCII char is suspect too. That means you need to list the special characters in your language, and extras: like special quotes, in English borrows like ç, ñ.
For MS-Windows Latin-1 (an altered ISO Latin-1) something like:
boolean isSuspect(char ch) {
if (ch < 32) {
return "\f\n\r\t".indexOf(ch) != -1;
} else if (ch >= 127) {
return false;
} else {
return suspects.get((int) ch); // Better use a positive list.
}
}
static BitSet suspects = new BitSet(256);
static {
...
}

Related

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).

Why character can't print after 128 number

In my project i am try to convert Binary number to integer and convert integer to Character. But after 128 number print only '?' character. Please help me how to print up to 250 characters. My code is
class b
{
public static void main(String[] args)
{
String dec1="11011001" ;
System.out.println(dec1);
int dec = Integer.parseInt(dec1, 2);
System.out.println(dec);
String str = new Character((char)dec).toString();
System.out.println("decrypted number is "+str);
}
}
Thank you.
Not all byte values have a printable character associated with them, ASCII does not, many/most unicode bytes do not, and the range 0x00 - 0x1f are all unprintable controls such as DC1, Bell, Backspace, etc. Unicode has the same first 32 characters reserved as non-printable.
Byte values above 127 (0x7f) have different meanings in different encodings, there are many encodings. Historically ASCII was the default encoding and there were many extensions to it. These days the standard is unicode which exists in several varieties including UTF-8, UTF-16 (LE, BE and BOM) and UTF-32 (LE, BE and BOM). UTF8 is common for interchange especially over the net and UTF-16 internally in many systems.
Depending on the encoding and glyph (displayed representation) it may take from one to over 16 bytes to represent a single glyph. Emoji mostly are in code plane 1 meaning that they require more than 16-bits for their code point (unicode is a 21-bit encoding system). Additionally some glyphs are represented by a sequence of code points, examples are flags which combine a country with the flag and Emoji joined with "joiners".
In the case of 217 (0xd9) that is not a legal codepoint in UTF-8 but 217 as two bytes (16-bit integer) (0x00d9) is a valid representation of Ù.
See ASCII and Unicode.
As per your code,First the binary will be converted to Integer and Then you are converting Integer to the Character which is done by checking the ASCII value. It will return the character having same ASCII value as the Integer dec1 you are converting. Since in ASCII TABLE the values are upto 127, You will get the character upto the integer value 127, So for the greater value of dec1 than 127, You will get character as ? which will be then converted into String. First 32 elements are non-printable characters so you will get some strange symbol for it but for value of dec1 in the range 32-126, You will get the character assigned to that particular ASCII value as per ASCII TABLE. Since the value 127 is assigned to DEL, you will get strange symbol for value of dec 127.
The issue is that your console's encoding doesn't match the encoding of the output of your Java program. I don't know what console you're using, but on Windows, you can run this command to see your current encoding:
chcp
The default console's encoding for USA is 437 and for Western Europe and Canada 850. These encodings have the 128 characters from ASCII encoding and 128 additional characters that are different from one encoding to another. You get nothing beyond the 128 ASCII characters because your Java output's encoding doesn't match the console's encoding. You have to change one of them to match the other.
You can change your console's encoding to UTF-8 by running this command:
chcp 65001
If you're not on Windows, you'll have to search for the equivalent commands for your system. But I believe on most Linux & Unix derived systems, you can use the locale command to see the current encoding and the export command to change it.
I receive the following output from your code. I assume that you run the program in an environment/console that doesn't support the character. You need a console that support UTF-8, UTF-16 or similar to be able to print all characters you setup numerical values for.
11011001
217
decrypted number is Ù

Obtain ascii values of chars in C#

I am accustomed to coding in java, but recently I have been making some ASP webpages that use C#.
In Java chars are default represented by their numeric ascii value unless you put them with a string. I have been unable to repeat this in C#.
What do I need to do to get ascii values of chars in C#?
char in .Net is a 2-byte structure representing a UTF-16 encoding of a unicode code point - of which ASCII is a tiny subset. But some unicode code points including certain Kanji characters require more than two bytes, and these are represented in a .Net string as a surrogate pair. Thus the most general way to get an unicode code point value for a character in a string at a specified index is Char.ConvertToUtf32(string s, int index)
For instance, the following enumerates the unicode code point values in a string:
public static IEnumerable<int> Utf32CodePoints(string s, int index)
{
for (int length = s.Length; index < length; index++)
{
yield return char.ConvertToUtf32(s, index);
if (char.IsSurrogatePair(s, index))
index++;
}
}
If you explicitly want only ASCII values and want to skip non-ASCII characters, you could use the ASCII decoder with appropriate exceptions, as shown here: Encoding.ASCII Property. Alternatively, just cast each char to an int and check if its value falls between U+0000 and U+007F, which is the defined range for ASCII.
ASCII is very small subset of characters that can be represented in C#/Java.
Fastest way to get ASCII code (assuming you know that value fits in ASCII range):
var ascii = ((int)c) & 0x7F;
You may want to add range checks (0-0x7F) and fail if value falls outside the range. Alternatively you can use Encoding.ASCII to do conversion (will replace characters outside of the range with question marks).
Note: if your "ascii" actually mean "numeric value"/UTF-16 Unicode code than basic cast to ushort (or int) will work:
var code = (int)c;

char to Unicode more than U+FFFF in java?

How can I display a Unicode Character above U+FFFF using char in Java?
I need something like this (if it were valid):
char u = '\u+10FFFF';
You can't do it with a single char (which holds a UTF-16 code unit), but you can use a String:
// This represents U+10FFFF
String x = "\udbff\udfff";
Alternatively:
String y = new StringBuilder().appendCodePoint(0x10ffff).toString();
That is a surrogate pair (two UTF-16 code units which combine to form a single Unicode code point beyond the Basic Multilingual Plane). Of course, you need whatever's going to display your data to cope with it too...
Instead of using StringBuilder, you can also use a function
directly found in the Character class. The function is
toChars() and it has the following spec:
Converts the specified character (Unicode code point) to
its UTF-16 representation stored in a char array.
So you don't need to exactly know how the surrogate pairs look
like and you can directly use the code point. An example code
then looks as follows:
int ch = 0x10FFFF;
String s = new String(Character.toChars(ch));
Note that the datatype for the code point is int and not char.
Unicode characters can take more than two bytes which can't be in general hold in a char.
Source
The char data type are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
In the J2SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

Java Unicode encoding

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?
Does this boil down to what character encoding you are using?
You can handle them all if you're careful enough.
Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).
See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
To add to the other answers, some points to remember:
A Java char takes always 16 bits.
A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).
"Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.
You said:
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.
Unicode grows
Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow — and not just because of emojis.
143,859 characters in Unicode 13 (Java 15, release notes)
137,994 characters in Unicode 12.1 (Java 13 & 14)
136,755 characters in Unicode 10 (Java 11 & 12)
120,737 characters in Unicode 8 (Java 9)
110,182 characters in Unicode 6.2 (Java 8)
109,449 characters in Unicode 6.0 (Java 7)
96,447 characters in Unicode 4.0 (Java 5 & 6)
49,259 characters in Unicode 3.0 (Java 1.4)
38,952 characters in Unicode 2.1 (Java 1.1.7)
38,950 characters in Unicode 2.0 (Java 1.1)
34,233 characters in Unicode 1.1.5 (Java 1.0)
char is legacy
The char type is long outmoded, now legacy.
Use code point numbers
Instead, you should be working with code point numbers.
You asked:
Does this mean that you can't handle certain Unicode characters in a Java application?
The char type can address less than half of today's Unicode characters.
To represent any Unicode character, use code point numbers. Never use char.
Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areas and will never be assigned.
The String class has gained methods for working with code point numbers, as did the Character class.
Get the code point number for any character in a string, by zero-based index number. Here we get 97 for the letter a.
int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.
For the more general CharSequence rather than String, use Character.codePointAt.
We can get the Unicode name for a code point number.
String name = Character.getName( 97 ) ; // letter `a`
LATIN SMALL LETTER A
We can get a stream of the code point numbers of all the characters in a string.
IntStream codePointsStream = "Cat".codePoints() ;
We can turn that into a List of Integer objects. See How do I convert a Java 8 IntStream to a List?.
List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;
Any code point number can be changed into a String of a single character by calling Character.toString.
String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A.
a
We can produce a String object from an IntStream of code point numbers. See Make a string from an IntStream of code point numbers?.
IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).
String output =
intStream
.collect( // Collect the results of processing each code point.
StringBuilder :: new , // Supplier<R> supplier
StringBuilder :: appendCodePoint , // ObjIntConsumer<R> accumulator
StringBuilder :: append // BiConsumer<R,​R> combiner
) // Returns a `CharSequence` object.
.toString(); // If you would rather have a `String` than `CharSequence`, call `toString`.
Cat 🐈
You asked:
Does this boil down to what character encoding you are using?
Internally, a String in Java is always using UTF-16.
You only use other character encoding when importing or exporting text in or out of Java strings.
So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to see that character, you must be using a font with a glyph defined for that particular character.
When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful.
Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.
In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
char is a UTF-16 code unit, not a code point
new low-level APIs use an int to represent a Unicode code point
high level APIs have been updated to understand surrogate pairs
a preference towards char sequence APIs instead of char based methods
Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.
Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
2 platform uses the UTF-16 representation in char arrays and in the
String and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP)
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. The lower (least significant) 21
bits of int are used to represent Unicode code points and the upper
(most significant) 11 bits must be zero. Unless otherwise specified,
the behavior with respect to supplementary characters and surrogate
char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate
ranges as undefined characters. For example,
Character.isLetter('\uD840') returns false, even though this specific
value if followed by any low-surrogate value in a string would
represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example,
Character.isLetter(0x2F81A) returns true because the code point value
represents a letter (a CJK ideograph).
From the OpenJDK7 documentation for String:
A String represents a string in the
UTF-16 format in which supplementary
characters are represented by
surrogate pairs (see the section
Unicode Character Representations in
the Character class for more
information). Index values refer to
char code units, so a supplementary
character uses two positions in a
String.

Categories