Java Character: Is there a value for "Not A Character"?

Java Character: Is there a value for "Not A Character"? - java

In Java, for Double, we have a value for NaN (Not A Number).
Now, for Character, do we have a similar equivalent for "Not A Character"?
If the answer is no, then I think a safe substitute may be Character.MIN_VALUE (which is of type char and has value \u0000). Do you think this substitute is safe enough? Or do you have another suggestion?

In mathematics, there is a concept of "not a number" - 5 divided by 0 is not a number. Since this concept exists, there is NaN for the double type.
Characters are an abstract concept of mapping numbers to characters. The idea of "not a character" doesn't really exist, since the charset in use can vary (UTF-8, UTF-16, etc.).
Think of it this way. If I ask you, "what is 5 divided by 0?", you would say it's "not a number". But, we do have a defined way to represent the value, even though it's not a number. If I draw a random squiggle and ask you, "what letter is this?", you would say "it's not a letter". But, we don't have a way to actually represent that squiggle outside of what I just drew. There's no real way to communicate the "non-character" I've just drawn, but there is a way to communicate the "non-number" of 5 divided by 0.
\u0000 is the null character, which is still a character. What exactly are you trying to achieve? Depending on your goal \u0000 may suffice.

The "not-a-number" concept does not really belong to Java; rather, Java defines double as being IEEE 754 double precision floating-point numbers, which have that concept. (That said, if I recall correctly, Java does specify some details about NaN that IEEE 754 leaves open to implementations.)
The analogous standard for Java char is Unicode: Java defines char as being UTF-16 code units.
Unicode does have various reserved-undefined characters that you could use; for example, U+FFFF ('\uFFFF') will never be a character. Alternatively, you could use U+FFFD ('\uFFFD'), which is a character, but is specifically the "replacement character" suitable for replacing garbage or invalid characters.

Depends what you're trying to do. If you're trying to represent the lack of a character you could do
Optional<Character> noCharacter = Optional.empty();

You could check if the character's code is greater than or equal to the value of 'a' and less than or equal to the value of 'Z'. That would qualify as not a character if by not a character, you mean an alphabet letter. You could extend it to symbols like question mark, full stop, comma etc, but if you want to go further than ASCII territory, I think it gets out of hand.
One other approach would be to check if something is a number. If it's not, you could check if it's a white character, then if it's not, everything else qualifies as a character, therefore you get your answer.
It's a long discussion IMO, because answers vary, depending on your view on what's a character.

Related

Why can't an identifier begin with a digit in java

I can not think anything other than "string of digits would be a valid identifier as well as a valid number."
Is there any other explanation other than this one?

Because that would make telling number literals from symbols names a serious PITA.
For example with a digit being valid for the first character a variables of the names 0xdeadbeef or 0xc00lcafe were valid. But that could be interpreted as a hexadecimal number as well. By limiting the first character of a symbol to be a non-digit, ambiguities of that kind are avoided.

If it could then this assignment would be possible
int 33 = 44; // oh oh
then how would the JVM distinguish between a numeric literal and a variable?

It's to keep the rules simple for the compiler as well as for the programmer.
An identifier could be defined as any alphanumeric sequence that can not be interpreted as a number, but you would get into situations where the compiler would interpret the code differently from what you expect.
Example:
double 1e = 9;
double x = 1e-4;
The result in x would not be 5 but 0.0001 as 1e-4 is a number in scientific notation and not interpreted as 1e minus 4.

This is done in Java and in many other languages so that a parser could classify a terminal symbol uniquely regardless of its surrounding context. Technically, it is entirely possible to allow identifiers that look like numbers or even like keywords: for example, it is possible to write a parser that lifts the restriction on identifiers, allowing you to write something like this:
int 123 = 321; // 123 is an identifier in this imaginary compiler
The compiler knows enough to "understand" that whatever comes after the type name must be a variable name, so 123 is an identifier, and so it could treat this as a valid declaration. However, this would create more ambiguities down the road, because 123 becomes in invalid number "shadowed" by your new "identifier".
In the end, the rule works both ways: it helps compiler designers write simpler compilers, and it also helps programmers write readable code.
Note that there were attempts in the past to build compilers that are not particularly picky about names of identifiers - for example
int a real int = 3
would declare an identifier with spaces (i.e. "a real int" is a single identifier). This did not help readability, though, so modern compilers abandoned the trend.

Casting ((char) a) where a is any number between int range

I know that ASCII codes are between 0-127 in decimal and 0000 0000 to 0111 1111 in binary, and that values between 128-255 are extended ASCII.
I also know that int accepts 9 digits(which I was wrong the range int is between(-2,147,483,648 to 2,147,483,647)), so if we cast every number between (0-MaxintRange) to a char, there will be many many symbols; for example:
(char)999,999,999 gives 짿 which is a Korean symbol (I don't know what it even means; Google Translate can't find any meaning!).
The same thing happens with values between minintrange to 0.
It doesn't make sense that those symbols were input one by one.
I don't understand - how could they assign those big numbers to have its own character?

I don't understand how they assign those big numbers to have it's own symbol?
The assignments are made by the Unicode consortium. See http://unicode.org for details.
In your particular case however you are doing something completely nonsensical. You have the integer 999999999 which in hex is 0x3B9AC9FF. You then cast that to char, which discards the top four bytes and gives you 0xC9FF. If you then look that up at Unicode.org: http://www.unicode.org/cgi-bin/Code2Chart.pl and discover that yes, it is a Korean character.
Unicode code points can in fact be quite large; there are over a million of them. But you can't get to them just by casting. To get to Unicode code points that are outside of the "normal" range using UTF-16 (as C# does), you need to use two characters. See http://en.wikipedia.org/wiki/UTF-16, the section on surrogate pairs.
To address some of the other concerns in your question:
I know that ACCII codes are between (0-127) in decimal and (0000 0000 to 0000 1111) in binary.
That's ASCII, not ACCII, and 127 in binary is 01111111, not 00001111
Also we know that int accepts 9 digits, so if we cast every number between
The range of an int is larger than that.
don't know what it mean even Google translate can't find any meaning
Korean is not like Chinese, where each glyph represents a word. Those are letters. They don't have a meaning unless they happen to accidentally form a word. You'd have about as much luck googling randomly chosen English letters and trying to find their meaning; maybe sometimes you'd choose CAT at random, but most of the time you'd choose XRO or some such thing that is not a word.
Read this if you want to understand how the Korean alphabet works: http://en.wikipedia.org/wiki/Hangul

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex

WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.

It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.

It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).

Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

In what encoding is a Java char stored in?

Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?

"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.

A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.

Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.

Can I determine whether Data is English or Chinese language?

Is it possible to determine whether data is in English or Chinese?

This is for example possible using statistical methods. English language has a very distinctive distribution of characters that appear at all, and a very distinctive distribution of what characters appear following another character (that would be called a level-1 model).
If 'e' is the most common symbol, it is very unlikely that the language is not something of European origin.
It may also be possible rather trivially (but maybe not 100% reliably) to do such a distinction by looking at Unicode character values (converting between character sets if necessary). If there are characters with a Unicode value greater than 127, English is somewhat unlikely (note that there are symbols like € though).
If there are many characters with Unicode values in the thousands, east Asian languages become more and more likely, with codes > 65535 being guaranteed to be Chinese.

My idea is to calculate the average position of the characters in the Unicode table. Since Chinese characters are located after ASCII (e.g. after value 127) you could easily determine if the text is English or Chinese.
edit: Basically the same Damon added. >_>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.